New approaches to the analysis of eye movement behaviour across expertise while viewing brain MRIs

Crowe, Emily M.; Gilchrist, Iain D.; Kent, Christopher

doi:10.1186/s41235-018-0097-4

Original article
Open access
Published: 25 April 2018

New approaches to the analysis of eye movement behaviour across expertise while viewing brain MRIs

Emily M. Crowe¹,
Iain D. Gilchrist¹ &
Christopher Kent¹

Cognitive Research: Principles and Implications volume 3, Article number: 12 (2018) Cite this article

3142 Accesses
9 Citations
5 Altmetric
Metrics details

Abstract

Brain tumour detection and diagnosis requires clinicians to inspect and analyse brain magnetic resonance images. Eye-tracking is commonly used to examine observers’ gaze behaviour during such medical image interpretation tasks, but analysis of eye movement sequences is limited. We therefore used ScanMatch, a novel technique that compares saccadic eye movement sequences, to examine the effect of expertise and diagnosis on the similarity of scanning patterns. Diagnostic accuracy was also recorded. Thirty-five participants were classified as Novices, Medics and Experts based on their level of expertise. Participants completed two brain tumour detection tasks. The first was a whole-brain task, which consisted of 60 consecutively presented slices from one patient; the second was an independent-slice detection task, which consisted of 32 independent slices from five different patients. Experts displayed the highest accuracy and sensitivity followed by Medics and then Novices in the independent-slice task. Experts showed the highest level of scanning pattern similarity, with medics engaging in the least similar scanning patterns, for both the whole-brain and independent-slice task. In the independent-slice task, scanning patterns were the least similar for false negatives across all expertise levels and most similar for experts when they responded correctly. These results demonstrate the value of using ScanMatch in the medical image perception literature. Future research adopting this tool could, for example, identify cases that yield low scanning similarity and so provide insight into why diagnostic errors occur and ultimately help in training radiologists.

Significance

According to the American Brain Tumor Association (2017), nearly 80,000 cases of primary brain tumour are expected to be diagnosed in 2017. Clearly, the successful detection of brain tumours is essential for diagnosis, patient monitoring, treatment planning and patient prognosis. Current best practice requires clinicians to inspect and analyse MRIs. Eye-tracking has commonly been used to examine the gaze behaviour of observers in this task, but limited research has examined the sequence of eye movements observers engage in when searching for abnormalities. We used a novel technique, ScanMatch, to compare saccadic eye movement sequences in a brain tumour detection task. This method utilises both temporal and spatial components of eye movement sequences and therefore enables a more detailed investigation into the search behaviour of observers. This research demonstrates the effective application of ScanMatch to the medical image perception literature thus offering a new approach to the analysis of eye movement behaviour.

Background

Medical imaging is a crucial tool when making diagnostic and treatment decisions. Clinicians inspect an image to first detect and then interpret any abnormalities in the context of a given medical problem. Approximately 5 billion diagnostic examinations are performed worldwide each year (Ciarrapico et al., 2017), with radiologic image perception and interpretation occurring at a rate of more than one per second in the United States (Beam, Krupinski, Kundel, Sickles, & Wagner, 2006). Despite advances in computer-aided detection (CAD), final medical decision-making resides with clinicians and so is constrained by their perceptual and cognitive abilities. A large body of eye-tracking research has been conducted to better understand how clinicians engage in these interrelated processes and so provide insight into the relationship between visual search and diagnostic decision-making (Reingold & Sheridan, 2011).

Novice-expert studies, which examine the effect of expertise on gaze behaviour, have revealed that experts have faster overall search times to detect and confirm the presence of an abnormality (Krupinski, 1996; Krupinski et al., 2006). Experts fixate on lesions, or other regions of interest, faster and for longer than Novices (Kundel, Nodine, Conant, & Weinstein, 2007; Nodine, Kundel, Lauver, & Toto, 1996) and, in clear images, spend more time fixating regions that are most likely to contain abnormalities (Kundel, 1974). Nodine et al. (1996) suggested that Novices engage in less efficient searches as indexed by their greater coverage of the medical image (Krupinski, 1996; Manning, Ethell, Donovan, & Crawford, 2006; Nodine et al., 1996). Such research demonstrates that gaze behaviour changes as a function of expertise and hints at the possibility that we may be able to identify what characterises expertise and how knowledge of this can be used to improve training practices, and efficiency, for future clinicians.

Clinicians make diagnostic mistakes, with estimates suggesting approximately 30% false-negative and false-positive rates in radiology (Krupinski, 2010). The prevalence effect reveals that observers often miss rare targets, which are not often encountered in daily medical screening, compared to more frequently encountered targets (Evans, Tambouret, Evered, Wilbur, & Wolfe, 2011; Wolfe et al., 2007). Berbaum et al. (2001) revealed that radiologists are susceptible to satisfaction of search, whereby detection of a first abnormality detracts for the detection of subsequent abnormalities. Drew, Võ, and Wolfe (2013) demonstrated inattentional blindness in radiologists who failed to report seeing a gorilla within a lung-nodule detection task. Eye-tracking has therefore been used to investigate why and where errors occur. Kundel, Nodine, and Carmody (1978) developed a nodule detection model based on the assumption that prolonged dwell times indicate intensive processing of visual data to enable classification of false-negative responses to pulmonary nodules into different types of error. Scanning errors reflect a failure to fixate the lesion areas and recognition errors occur when the lesion area has been fixated but an observer does not detect the lesion. Decision-making errors are those where the interpretation of a lesion is incorrect. Out of 20 false-negative diagnoses performed by four radiologists, 30%, 25% and 45% were classified as scanning, recognition and decision-making errors, respectively (Kundel et al., 1978). Several researchers have reported longer fixations for false negatives (Kundel & Nodine, 2004; Nodine, Mello-Thoms, Kundel, & Weinsten, 2002), indicative of prolonged visual attention. Krupinski (2005) examined the effect of lesion subtlety on gaze behaviour and found that when subtler lesions were detected, dwell time was longer than for both the more obvious lesions and false negatives. Taken together these findings demonstrate a relationship between gaze behaviour and diagnostic accuracy, with certain behaviours characterising certain responses.

Scan paths which can capture both temporal and spatial components of an individual’s search have also been investigated in the medical image perception literature. Kundel and Nodine (2004) revealed that quantitative parameters derived from scan paths can be used to separate mammographers and trainee mammographers. Gandomkar, Tay, Brennan, and Mello-Thoms (2017) developed a model with 86.3% and 85.2% sensitivity and specificity, respectively, that distinguished expert and less experienced radiologists based on the spatial dynamics of their eye movements. Such research indicates that gaze behaviours characterise expertise in medical image interpretation tasks. Litchfield, Ball, Donovan, Manning, and Crawford (2008, 2010) revealed that viewing another person’s eye movements on a lung nodule detection task can improve performance while Sridharan, Bailey, McNamara, and Grimm (2012) reported higher sensitivity and specificity when Novices used a subtle gaze direction (SGD) technique that actively guides Novices along the scan path of an expert. Taken together, these studies indicate that visual guidance aids medical image interpretation.

Research also suggests that scanning patterns are related to diagnostic accuracy. Davies et al. (2016) examined how practitioners perceived electrocardiograms (ECGs) and determined whether visual behaviour can indicate differences in interpretation accuracy. Their results demonstrated a difference in the gaze behaviour between correct and incorrect interpretations of various heart-related measurements (e.g. identifying hyperkalaemia, torsades de pointes and atrial flutters) and so highlighted this as a factor in interpretation accuracy. Voisin, Pinto, Morin-Ducote, Hudson, and Tourassi (2013) used machine learning to successfully predict radiologists’ errors during the diagnosis of mammographic lesions by merging their gaze behaviour and textural characteristics of the image. Tourassi, Mazurowski, Harrawood, and Krupinski (2010) examined the potential of a context-sensitive computer-assisted detection (CADe) system that is guided by the user’s focus of attention. The context-sensitive mode of the system, which analysed radiologists scanning patterns and diagnostic decisions while inspecting 20 mammograms, reduced radiologists’ perceptual and cognitive errors in the diagnostic interpretation of screening mammograms more effectively than the conventional CADe system. Taken together, the eye-tracking literature indicates that scanning patterns are related to diagnostic accuracy.

However, only a limited amount of research has investigated similarities between the scan paths of participants viewing medical images. Wooding, Roberts, and Phillips-Hughes (1999) reported that trainee radiologists (16.5 months experience) showed the least within-group consistency compared to laymen (0 months experience), novices (2.3 months experience) and radiologists (90 months experience). Trainee radiologists also showed the least amount of similarity to radiologists compared to all other comparison groups. The authors suggested that trainees go through a developmental phase characterised by idiosyncratic patterns of attention allocation and eye movements. Leong, Nicolaou, Emery, Darzi, and Yang (2007) examined whether experience improves the consistency of visual search behaviour in fracture identification in plain radiographs. Using Kullback-Leibler divergence and Gaussian mixture model fitting, these authors reported that experts exhibited higher consistency in their search patterns.

The present study will extend the limited literature examining the similarity of scan paths. More specifically, we use ScanMatch (Cristino, Mathôt, Theeuwes, & Gilchrist, 2010), a well-established approach to quantifying the similarity between scanning patterns of individuals. At its core, ScanMatch is based on the Needleman-Wunsch algorithm, used commonly for comparing DNA sequences. We chose this method because it accounts for the temporal, spatial and sequential components of fixations and so overcomes limitations of existing string edit methods (Cristino et al., 2010). Moreover, the substitution matrix allows researchers to encode information about the relationship between specific regions of interest and so can account for semantic information as well (Cristino et al., 2010). Madsen, Larson, Loschky, and Rebello (2012) used ScanMatch to examine differences in the eye movements of individuals answering physics questions to examine if there was a difference for correct versus incorrect answers. This method has also been used to examine differences in scanning patterns between Novices and Experts evaluating paintings (Pihko et al., 2011) and while viewing surgical procedures (Kübler, Eivazi, & Kasneci, 2015), problem-solving (Nyamsuren & Taatgen, 2013), face-processing (Chaby, Hupont, Avril, Luherne-du Boullay, & Chetouani, 2017), and decision-making (Zhou et al., 2016). Anderson, Anderson, Kingstone, and Bischof (2015) compared the ability of several scan path comparison methods to reveal similarities both within and between individuals looking at natural scenes and concluded that ScanMatch is a remarkable improvement on more simple method string-edit and linear distance methods. Using this tool, we will investigate the effect of expertise and diagnosis on the similarity of scanning patterns in a brain tumour detection task using MRIs.

Medical image perception tasks include both static two-dimensional (2D) image viewing and dynamic stack viewing. Stack viewing involves a clinician quickly scrolling through a stack of 2D images to get a three-dimensional (3D) impression of the anatomical structure of an organ (Nakashima, Komori, Maeda, Yoshikawa, & Yokosawa, 2016). The shift from static to dynamic viewing has changed the task of medical image interpretation with a tiled set of 2D images containing less information than a volumetric image (Krupinski et al., 2012). Medical students tend to perform worse on volumetric images than on 2D images (Ravesloot, van der Gijp et al., 2015; Ravesloot, Van Der Schaaf et al., 2015). van der Gijp et al. (2015) showed that radiology clerks take more time, and engage in more and different cognitive processes, when interpreting volumetric images than 2D images with Stuijfzand et al. (2016) reporting an effect of image information (i.e. 2D or 3D) on self-reported mental effort used to index cognitive load. 3D volumetric image interpretation better reflects the clinical setting for inspecting brain MRI images in which the brain is separated into cross-sections or ‘slices’. Therefore, in this experiment, it is important to investigate eye-gaze behaviour in dynamic viewing and the visual search of clinicians viewing sequentially presented, dependent, medical images (Drew, Võ, Olwal et al., 2013; Nakashima et al., 2016). The dependence between sequential images from the same brain is important as the clinician may use information from previous slices to direct their attention on the current slice.

Despite the wealth of research into medical image interpretation, most studies have used either the chest or breast as stimuli (see Reingold & Sheridan, 2011, for a review). Ostrom et al. (2015) predicted that in 2016 approximately 77,670 primary brain and central nervous system tumours were expected to be diagnosed in the United States. Although eye-tracking-based research has assessed clinician’s inspection of brain MRI images for glioma diagnosis (Cavaro-Ménard, Tanguy, & Le Callet, 2010) and the eye-gaze distribution of neurologists when viewing CT images of stroke patients (Matsumoto et al., 2011), there is significantly less work examining how clinicians and novices view MRI images of the brain. To start to address this gap, the current study uses MRI brain images as stimuli.

Here, we use a novice-expert design to examine how eye-gaze parameters change across three expertise levels (i.e. undergraduate students: Novices; third and fourth year undergraduate medical students: Medics; and medical professionals: Experts) in a brain tumour detection task using MRIs. This study is the first to apply ScanMatch to the medical image perception literature to better understand the temporal dynamics of image interpretation. Thirty-five participants completed both a whole-brain (Experiment 1a) and independent-slice (Experiment 1b) brain tumour detection task to further investigate a prevalent issue in the literature, namely the effect of viewing modality on visual search and performance. In the whole-brain task, eye-gaze data was recorded while participants sequentially viewed 60 slices of a patient’s brain MRI and in the independent-slice task, both eye-gaze data and performance measures were recorded when participants inspected 32 brain MRIs (16 tumorous; 16 healthy). This exploratory work examines the efficacy of a novel technique, ScanMatch, within the medical image perception literature. The application this technique could have practical implications including the development of medical training and monitoring of students’ acquisition of expertise.

Experiment 1a: whole-brain

Method

Participants

Thirty-five participants were recruited into three groups based on their level of expertise in brain MRI interpretation. The Novice group consisted of 18 undergraduates at the University of Bristol studying any subject apart from medicine, dentistry or veterinary sciences. The Medic group were ten medical students from the University of Bristol in either their third (n = 8) or fourth (n = 2) year of study. Seven Experts were recruited from a National Health Service (NHS) hospital and consisted of trainee neuroradiologists (n = 3; mean experience 2.5 years), consultant neuroradiologists (n = 2; mean experience 8 years) and consultant neurologists (n = 2; mean experience 12 years). All participants had normal or corrected-to-normal vision and gave written informed consent in accordance with the Declaration of Helsinki (2008). Ethical approval was obtained from the Faculty of Science Human Research Ethics Committee at the University of Bristol.

Stimuli and apparatus

All brain images and diagnoses were obtained from a UK NHS hospital. Stimuli for the whole-brain task consisted of 60 T2 brain MRI images from one patient with a right medial temporal lobe intrinsic tumour (see Fig. 1 for example slices). T1 and T2 images are commonly acquired medical images used in clinical settings for inspecting brains. We used T2 images for the whole-brain task because this is most heavily relied upon for brain tumour detection. All stimuli were registered using SPM5 (Penny et al., 2001). Stimuli were presented using a custom-made programme written using MATLAB (The MathWorks, Inc., 2013) and the Psychophysics Toolbox (Psychtoolbox-3; Brainard, 1997; Kleiner, Brainard, & Pelli, 2007; Pelli, 1997). Eye-gaze data were recorded from the participant’s dominant eye using the Eye-Link 1000 (SR Research, Mississauga, ON, Canada), an infrared tracking system that uses the pupil centre in conjunction with corneal to sample eye position at 1000 Hz. For each data sample, a dedicated parser algorithm (SR Research, Mississauga, ON, Canada) computes the instantaneous velocity and acceleration of the eye. These are then compared to threshold criteria for velocity (30°/s) and acceleration (8000°/s2). If either is above threshold, the eye movement is classified as a saccade. A MATLAB script (The MathWorks, Inc., 2013) was then used to extract all the saccades from the Eyelink Data File. Using a chin rest, participants viewed stimuli on a colour laptop monitor (1280 × 800 pixel resolution) from a distance of 50 cm in a darkened room.

Design

This was a mixed design with Expertise Level (Novice, Medic, Expert) as the between-subject factor and Run Through (Run 1, Run 2) as the within-subjects factor. In order to assess intra-group similarity in scanning patterns, within-group comparison was included as a between-subject factor (i.e. Novice-Novice, Medic-Medic, Expert-Expert). To investigate between-group similarity, between-group comparison was included as a within-subject factor (e.g. Novice-Novice, Novice-Medic, Novice-Expert).

Procedure

A standard nine-point calibration and validation procedure was performed in which observers were asked to fixate on a black cross that appeared randomly on a 3 × 3 grid. Following this, participants were instructed to search the stack of brain MRI images for any tumorous tissue. Each slice was presented for 1500 ms and was immediately followed by the subsequent slice. This process was completed twice with a 2000-ms break between runs. Participants were told to freely inspect the brain during both runs. The stimuli were selected so that the tumour was very clear to all observers and therefore participants were not required to provide a diagnosis (see Fig. 1). The task lasted approximately 4 min.

Analysis

Our analysis focused on the fixation patterns. We measured the similarity of the sequences of fixations using ScanMatch (Cristino et al., 2010). A letter-based string sequence was generated for each participant that described their fixations. The sequences of different participants are then compared using ScanMatch. An alignment score is generated and then normalised to provide a ScanMatch similarity score, namely an index of scanning similarity. A similarity score of 1 indicates that the sequences are identical while a score of 0 indicates that there is no similarity. Figure 2 shows details of the ScanMatch method. More specifically, within-group comparisons and between-group comparisons were examined for Run1 and Run 2. The sequence of fixations for Run 1 and Run 2 were those that were recorded during the first and second presentation of the 60 slices which constituted an entire brain, respectively. For each run, the sequence was presented across all slices, because the first fixation location on a given slice would have depended on the final fixation location of the previous slice. The sequence and duration of fixations were used to generate a letter-string sequence that corresponded to spatial locations. We then compared this sequence with other participants’ sequences.

Figure 3 explains how to interpret different ScanMatch similarity scores in the context of this experiment. In order to gain a clearer insight into this, we used the experimental data as the reference sequence and compared this with a test sequence. The first test sequence had one fixation replaced with a randomly selected alternative fixation. After each comparison, another random replacement was made in the same manner. This analysis demonstrates effect of the number of different fixations between two scanning sequences on ScanMatch similarity score (see Fig. 3). Of course, the actual pattern is more complicated than this as ScanMatch takes into account temporal order, so it does not simply reflect the number of fixation differences (this is a proxy in order to illustrate what a difference may mean in terms of fixation differences).

Results

The results reported below lie in the middle range of ScanMatch similarity scores (see Fig. 3). This indicates that scanning patterns are not in high agreement nor entirely random. Moreover, while many of the reported differences between conditions are small, they are statistically reliable. A 3 × 2 mixed ANOVA revealed that scanning patterns were more similar for Run 1 (M = 0.56, 95% confidence interval [CI] = 0.55–0.57) than Run 2 (M = 0.52, 95% CI = 0.50–0.54), F(1,32) = 27.98, p < 0.001, η_p² = 0.466. An effect of within-group comparison on ScanMatch similarity scores was also found, F(1,32) = 6.86, p = 0.003, η_p² = 0.300. Experts displayed the most similar scanning patterns (M = 0.58, 95% CI = 0.55–0.61), followed by Novices (M = 0.54, 95% CI = 0.52–0.56) and then Medics (M = 0.50, 95% CI = 0.47–0.53). There was no interaction, F(2,32) = 1.72, p = 0.196, η_p² = 0.097.

A similar qualitative pattern of results was observed for between-group comparisons (see Table 1). Across all between-group comparisons, scanning similarity was higher for Run 1 than Run 2 (all F > 6.34, p < 0.014). There was an effect of between-group comparison when comparing novices against all other expertise levels F(2,34) = 13.10, p < 0.001, η_p² = 0.435. Novice-Novice and Novice-Expert comparisons were more similar than Novice-Medic comparisons. There was also an interaction, F(2,34) = 8.56, p = 0.001, η_p² = 0.335, with all between-group comparisons revealing higher scanning similarity for Run 1 than Run 2.

Table 1 Means (95% CI) for each between-group comparison and run through for ScanMatch similarity score

Full size table

Medic-Medic comparisons were less similar than Medic-Novice comparison and Medic-Expert comparison, F(2,18) = 7.34, p = 0.008, η_p² = 0.413. Expert-Expert and Expert-Novice comparisons were more similar than Expert-Medic comparisons, F(2,12) = 15.83, p < 0.001, η_p² = 0.725. Study participants did not consent to data sharing so supporting data cannot be made available.

Experiment 1b: independent-slices

Since most existing research uses 2D viewing modalities, we also examined the effect of expertise and diagnosis on eye-gaze behaviour and diagnostic accuracy in an independent-slice brain tumour detection task. This provides insight into the effect that viewing modality may have on gaze behaviour in such tasks.