Our results are significant in five main respects. First, they empirically establish that both practicing radiologists and non-professional subjects commit significant errors in estimating the actual probability of cancer based on probabilistic information. Moreover, both sets of subjects significantly overestimate the probability of cancer, and do so in similar ways. This indicates that the tendency to overestimate cancer probabilities is not idiosyncratic to subjects in either group.
These findings are significant in and of themselves, quite apart from what they mean for recall decisions (which we address further below). While clinicians are rarely, if ever, called upon to formally estimate probabilities, probabilistic reasoning in general does have profound clinical implications. Indeed, in his influential study that used the same probability estimation paradigm as ours, Eddy (1982) argued that “errors [in probabilistic reasoning] threaten the quality of medical care”, and that “the power of formal probabilistic reasoning provides great opportunities for improving the quality and effectiveness of medical care”. Many subsequent studies of CAD use have echoed this sentiment (de Hoop et al., 2010; Keen et al., 2018; Lehman et al., 2015; Regge & Halligan, 2013; Yanase & Triantaphyllou, 2019).
Second, and even more importantly, we show that probability estimations closely reflect the recall decisions, at least in the case of non-professional subjects. This is not necessarily to say that the subjects explicitly base their recall decisions on their estimations of cancer probability; it is possible that both cognitive processes reflect a third, unknown process.
Note, parenthetically, that our study design made it possible to reject, at least for non-professional subjects, the hypothesis that recall decisions always reflect a reasonable, if hypervigilant, strategy. This is because the subjects recalled patients even when the actual probability of cancer was literally zero.
Our study also highlights the potential usefulness of studying selected aspects of the clinical decision-making process in non-professional subjects. For instance, our results make it reasonable to hypothesize that the close relationship between probability estimations vs. recall decisions we demonstrate in non-professional subjects also holds for practicing radiologists. The design of Exp. 2 provides a straightforward template for testing this hypothesis.
Third, our results reveal three different, significant sources of the estimation errors by radiologists: (1) the neglect of the prevalence of breast cancer, (2) overweighting of the binary decision of the CAD system in each individual case (i.e., ‘individuating’ information), and (3) the binary decision-dependent neglect of the false alarm rate. Note that the latter two factors pertain to the CAD system per se (as opposed to the base rate, which is not a property of the CAD system). Somewhat surprisingly, non-professional subjects were better at accounting for the base rate, although their overestimations were significantly worse. The reasons for these differences remain to be established.
Importantly, our results empirically demonstrate that observed estimation errors were not fully attributable to well-known effects such as base rate neglect or overweighting of the individuating information by either group of subjects. That is, while these effects did contribute to the observed cancer probability estimates, they did not fully account for them. Incidentally, this finding highlights the importance of having empirically measured these errors in the specific context of interpreting CAD results, because our results could not have been predicted by simply extrapolating from the previous studies of the underlying estimation problem in other contexts (for a review, see (Mandel, 2014)).
Fourth, our study identifies a novel effect that significantly contributes to the estimation errors by both radiologists and non-professional subjects, namely the conditional neglect of false alarm rates. The neglect of the false alarm rate has been previously reported in the context of legal decision-making (Dahlman et al., 2016). However, to our knowledge, the present study represents the first report of a conditional neglect of the false alarm rate in any decision-making context.
This effect is intriguing, because it means that subjects take the system’s false alarm rates into account if the system decides that the given mammogram is positive for cancer, and neglect the false alarm rates otherwise. One possible explanation of this is that if the CAD system reports that the given mammogram is positive for cancer, then the subjects consider the false alarm rate in order to help them determine whether or not the given positive report is a false positive (or false alarm). On the other hand, if the CAD system determines that the given mammogram is negative for cancer, the subjects ignore the false alarm rate, presumably because the false positive rate is moot when the system’s finding is negative to begin with. There is a certain intuitive logic to this.
Note, incidentally, that the fact that the subjects do take false alarm rates into account does not necessarily mean that the subjects attach the correct weight to the rates. Indeed, our results indicate that the radiologists do not attach proper weight to this factor (i.e., they overweight it). Further studies are needed to fully characterize this effect.
Finally, another significance of our study is that we quantitatively estimate the effect of each of the aforementioned contributing factors in both estimation and recall tasks. These estimates also confirm that CAD estimation errors reflect a unique weighted combination of the underlying contributing factors. This also suggests that there is no unifying explanation that a priori accounts for all such phenomena. The closest one can come to a unifying framework of explanation is that there is a finite set of potential contributing factors that potentially apply to all problems of this type, but the observed results in any given problem scenario depend on the relative weights of the various factors in that problem scenario. Since there is no known way of predicting these weights a priori, one must empirically estimate these weights (i.e., the relative contributions) of the various factors in each case. Again, this vindicates our empirical approach.
A corollary of this is that it can be misleading and counterproductive to attribute such errors a priori to a well-known cause, such as base rate neglect, or overweighting of the individuating information.
Probability estimation errors in other contexts
Neglect of the base rate and the overweighting of the individuating information have been shown to cause estimation errors in other contexts (Fischhoff & Bar-Hillel, 1984; Kahneman & Tversky, 1973; Mandel, 2014). Our results empirically demonstrate these effects in the context of CAD result interpretation. The neglect of the false alarm rate has been previously reported in the context of legal decision-making (Dahlman et al., 2016). However, to our knowledge, the present study represents the first report of a conditional neglect of the false alarm rate in any decision-making context.
Some previous studies in other contexts have shown that other approaches, such as appropriately modifying the statement of the problem and decision-work flow, can also reduce estimation errors (Hoffrage & Gigerenzer, 2004; Mandel, 2014; Wood, 1999). Our preliminary study did not address such admittedly important complexities because it simply aimed to empirically establish the existence of the aforementioned estimation and decision errors and their sources, and not how to reduce the errors.
Additional caveats and future directions
In addition to the various caveats specific to our study noted in context throughout this report, it is worth noting one general caveat that applies to our study and more broadly to all laboratory studies of clinical phenomena: Such studies simplify, out of both practical and scientific necessity, the underlying clinical problems. We have noted throughout this report the various ways in which our study does this. However, it is worth noting, without glossing over the limitations of our study, that our results would have been essentially uninterpretable if we had not simplified the study design by removing these potential confounding variables. For instance, if we had presented the actual mammograms to the subjects, we would not have been able to ascertain the contribution of the aforementioned four pieces of probabilistic information about the CAD system to the subjects’ responses, because of the potentially confounding contributions of the mammogram and how the subjects perceived it. In removing these confounds and presenting the results along with the applicable caveats as we do, our study adopted this standard tradeoff: Seek to clarify by simplifying.
Much future work is needed to address the many questions raised by our study. These include, but are not limited to, ascertaining whether and to what extent our results, especially from non-professional subjects, are generalizable to actual clinical conditions.
An important practical implication of our results is that that training subjects to properly interpret the output of CAD systems may help improve the efficacy of CAD systems in mammography (Yanase & Triantaphyllou, 2019). It may also help if the systems explicitly provide an additional piece of information, namely the expected probability of cancer for each given mammogram.