Improved X-ray baggage screening sensitivity with ‘targetless’ search training

Muhl-Richardson, Alex; Parker, Maximilian G.; Recio, Sergio A.; Tortosa-Molina, Maria; Daffron, Jennifer L.; Davis, Greg J.

doi:10.1186/s41235-021-00295-0

Original article
Open access
Published: 14 April 2021

Improved X-ray baggage screening sensitivity with ‘targetless’ search training

Alex Muhl-Richardson ORCID: orcid.org/0000-0001-5673-4052¹,
Maximilian G. Parker¹,
Sergio A. Recio¹,
Maria Tortosa-Molina¹,
Jennifer L. Daffron¹ &
…
Greg J. Davis¹

Cognitive Research: Principles and Implications volume 6, Article number: 33 (2021) Cite this article

4057 Accesses
4 Citations
1 Altmetric
Metrics details

A Correction to this article was published on 03 September 2021

This article has been updated

Abstract

When searching for a known target, mental representations of target features, or templates, guide attention towards matching objects and facilitate recognition. When only distractor features are known, distractor templates allow irrelevant objects to be recognised and attention to be shifted away. This is particularly true in X-ray baggage search, a challenging real-world visual search task with implications for public safety, where targets may be unknown, difficult to predict and concealed by an adversary, but distractors are typically benign and easier to identify. In the present study, we draw on basic principles of distractor suppression and rejection to investigate a counterintuitive ‘targetless’ approach to training baggage search. In a simulated X-ray baggage search task, we observed significant benefits to target detection sensitivity (d′) for targetless relative to target-based training, but no effects of performance-contingent rewards or the inclusion of superordinate semantic categories during training. The benefits of targetless search training were most apparent for stimuli involving less spatial overlap (occlusion), which likely represents the difficulty and greater individual variation involved in searching more visually complex images. Together, these results demonstrate the effectiveness of a counterintuitive targetless approach to training target detection in X-ray baggage search, based on basic principles of distractor suppression and rejection, with potential for use as a real-world training tool.

X-ray baggage search, the process of visually searching X-ray images of baggage for threats and prohibited items, is an important component of airport security that helps to ensure the safety of aircraft and passengers. Compared to other real-world visual search tasks, such as medical screening, baggage search involves some unique challenges, both due to how the task is specified and the nature of the visual environment.

One of the challenges in baggage search arises due to how the task is typically specified, i.e. in terms of a wide range of visually diverse and somewhat unpredictable targets that must be detected. There is an extensive literature on dual-target (and multiple-target) costs in visual search in both basic and applied tasks (Barrett & Zobay, 2014; Godwin et al., 2010; Menneer et al., 2007, 2009; Stroud et al., 2012), and such a cost is inevitably involved in baggage search as screeners are required to detect the presence of any target from a long list of prohibited items. However, comparatively high variation within some target categories may mean that some items within these categories are more difficult to detect (Hout & Goldinger, 2015) and, while some target categories can be quite precisely specified due to high within-category homogeneity, for many, high within-category heterogeneity is much more likely.

Developing broad categorical target templates that represent features common to a category is difficult for heterogenous categories. For example, pistols represent a comparatively homogenous target category and their features are easily predictable (mostly metal, small range of possible shapes). On the other hand, improvised explosives are much more heterogenous and their features can be difficult to predict. They may take any one of an almost unlimited number of forms and may share almost no features with other improvised explosives. Developing a categorical template that will support guidance towards and recognition of this type of target is difficult (Hout et al., 2017) and may limit the effectiveness of target-based training procedures.

The literature on distractor suppression in basic search tasks shows that, independently of target templates, searchers can use knowledge of distractors to build distractor templates (also called templates for rejection) to guide attention away from distractors (Arita et al., 2012; Daffron & Davis, 2015; Geng, 2014; Moher & Egeth, 2012). This voluntary process is distinct from the suppression of bottom-up attentional capture by irrelevant salient stimuli that has been the subject of recent work in this area (Chang & Egeth, 2019; Gaspelin et al., 2015). Sometimes this might involve initial attention towards and recognition of distractors, such that they might be rejected reactively, sometimes called a ‘search and destroy’ strategy, but alternatively, distractor rejection might be more proactive, with early attention towards distractors suppressed so that targets (known or unknown) might be located and recognised more readily. Regardless of the mechanism, distractor templates appear to facilitate the detection of novel search targets and can operate in the absence of a functional target template, which may be incredibly useful in tasks where it is difficult or impossible to specify or learn target features. Implicit forms of distractor suppression, for example, visual marking (Watson & Humphreys, 1997, 2000) and distractor-previewing effects (Ariga & Kawahara, 2004; Goolsby & Suzuki, 2001, 2002) may contribute to the development of distractor templates in addition to more explicit processes.

The present study focusses on distractor templates, which may be particularly important in baggage search for two reasons. Firstly, screeners almost exclusively encounter non-targets, as the vast majority of bags do not contain threats, and therefore experience a continuous steam of opportunities to learn about non-target features that can inform and support categorical distractor templates. Screeners might not experience a genuine target in their entire career, with simulated targets providing the only on-task learning opportunities. Secondly, the detection of novel or unusual targets (which often cannot be well specified in advance) would not rely on knowledge of target features, but rather the features of the much more familiar non-targets amongst which they appear.

Previous studies of distractor templates have involved basic laboratory search tasks, but it is difficult to predict how effective they may be in baggage search due to the unique combination of challenges to the human observer posed by this task. Perhaps the most fundamental of these is that baggage X-rays lack a regular structure (Donnelly et al., 2019). In laboratory tasks, everyday searches and medical imaging, the search environment involves some predictable structure, e.g. a radiographer examining a lung CT (computed tomography) scan knows how a human lung is typically arranged; in comparison, baggage screeners may have very few valid expectations about where items might appear within an image (McCarley et al., 2004).

Baggage X-rays typically follow a standard colour mapping, whereby objects are coloured based upon their density and absorption of X-ray radiation (Donnelly et al., 2019). By this mapping, metallic objects (higher density) are coloured blue and organic objects (lower density) are orange. Some items of moderate density are coloured green and extremely high-density objects appear as black. This colour mapping creates a novel search environment in which object appearance can differ significantly from expectations based on the normal visual world. Furthermore, all except extremely dense items have some degree of transparency, meaning that the appearance of almost every item will be influenced by spatially overlapping items, including the bag or container. Not only can overlap (occlusion) make it more difficult to segment images and identify object boundaries, but it also influences the colour of overlapping objects, which is typically based on averages of the relevant item properties (Godwin et al., 2017). These complexities and the lack of reliable cues that might be available in other visual scenes mean that search guidance in baggage X-rays is likely extremely limited (Vickery et al., 2005; J. Wolfe et al., 2008).

There is a large body of work on effects related to the extremely low prevalence of threats in baggage (Fleck & Mitroff, 2007; Godwin et al., 2010, 2015; Menneer et al., 2010; Mitroff & Biggs, 2014; Wolfe et al., 2007, 2013). In visual search tasks, low levels of target prevalence (i.e. when targets are rare) are associated with reduced hit rates (and false alarm rates) due to a conservative shift in response criterion. Eye movement studies went on to reveal that this criterion shift is associated with errors of perceptual selection, whereby targets are less likely to be fixated before the task is terminated, and of perceptual identification, whereby targets are less likely to be correctly identified following fixation (Godwin et al., 2015). While the effect of low prevalence in real-world baggage screening may be ameliorated through procedures that artificially increase threat prevalence (e.g. Threat Image Projection), it nonetheless remains a challenge for screeners (Donnelly et al., 2019). A related literature exists on satisfaction of search effects (also referred to as subsequent search misses; Adamo et al., 2018; Cain et al., 2013; Fleck et al., 2010), whereby searchers are less likely to detect subsequent targets after finding an initial target. However, real-world baggage search remains principally concerned with finding any single initial target, as this will always be sufficient to identify dangerous baggage.

While previous studies have examined X-ray baggage search from a human factors perspective (e.g. Buser et al., 2020; Hättenschwiler et al., 2019; Schwaninger, 2016), the present study takes a psychological approach by investigating the potential benefits of distractor templates in this context. We develop and test a targetless search training procedure for novice screeners, focussed on making use of broad categorical distractor templates in simulated X-ray baggage search tasks, incorporating some of the challenges discussed above. Acknowledging the severe limitations imposed on guidance when searching baggage X-rays, the focus of our targetless training procedure is not improving search guidance, but training the recognition (or identification) step of the search process, specifically with a focus on improving recognition of safe items rather than threats.

The current experiments build on unpublished pilot findings using the same rationale but with photographic stimuli, suggesting that a consequence of training distractor recognition is improved detection of challenging targets. The final experiment attempts to incorporate the preceding results to develop and test an enhanced targetless search training procedure. We examine performance both in terms of behavioural responses and Signal Detection Theory measures of sensitivity (d′) and bias (c).

Experiment 1

Experiment 1 aimed to test whether participants could be trained to make use of distractor templates, improving distractor recognition and, critically, aiding the detection of target objects (threat items that are prohibited in cabin baggage) in a simulated baggage search task. To do this, we developed a targetless search training procedure that focussed on training novice participants to recognise non-targets (safe items that are permitted in cabin baggage). We compared this targetless search training procedure with target-based search training (focussed exclusively on target recognition) and combined search training (including elements of target and non-target recognition). In order to avoid pre-existing biases for search strategies involving search for targets rather than rejecting non-targets and given a lack of baggage screening experience in our sample, we did not use a pre-/post-training test design and instead tested participants once following training. We expected that if participants naturally adopted a strategy based on searching for targets in a pre-training test, then this would potentially reduce the effectiveness of search training focussed on the rejection of non-targets.

We predicted that, relative to target-based search training, participants who received targetless search training would use distractor templates to recognise and exclude non-targets and therefore be better able to detect the presence of novel (untrained) targets in the simulated baggage search task. We also predicted that participants who received the combined search training would perform more similarly to those who received target-based training than targetless training due to a bias in favour of the target-based approach to search, i.e. searching for a target. Finally, we predicted that a conventional prevalence effect would be observed, such that low prevalence target categories would be associated with a lower hit rate than high prevalence target categories.

Method

Participants

Sixty participants (42 females, 18 males; M_age = 22.98 years, SD = 5.23) were recruited and randomly allocated to one of three training groups of equal size. Participants were recruited via the Department of Psychology Research Sign-up System and were reimbursed £10 for their time. All experiments presented in this manuscript were approved by the Ministry of Defence Research Ethics Committee and the Cambridge Psychology Research Ethics Committee.

Apparatus and stimuli

The experiment was programmed using PsychoPy 1.90.3 (Peirce, 2007, 2009) and presented on a 24″ Dell LCD monitor. Participants viewed the monitor from a distance of approximately 70 cm and responded using a keyboard.

The stimuli used for the familiarisation phase were taken from the CaSePIX X-ray image library, which we created using a Todd Research TR70 conveyor X-ray machine. These familiarisation phase stimuli consisted of a subset of 11 X-ray images of empty suitcases, with features such as zips and wheels labelled.

Further stimuli were created using SimFox (Renful Premier Technologies), web-based software used for real-world screener training. SimFox included an X-ray image library of baggage and items that can appear in baggage (including a range of threat and safe items) and we also used it to generate realistic composite images of baggage containing multiple items. Bag stimuli ranged in size from approximately 5.5° to 15.5° of visual angle in both dimensions and individual objects presented alone ranged in size from 0.5° to 16.8° degrees of visual angle in both dimensions.

We began by defining two sets of object categories, 14 threat categories and 14 safe categories (see Additional file 1: Supplementary Tables S1–S5). The training phase stimuli were X-ray images of individual objects that fitted into these categories and in total consisted of 196 threat objects and 196 safe items (participants only ever viewed a subset of these stimuli that was dependent upon their training group, see Design and Procedure). The testing phase stimuli were composite X-ray images of bags and were made to contain six individual objects. In total 168 testing phase bags were generated, 84 containing a single object from one of the threat categories (plus five safe items) and 84 containing only safe items (i.e. 50% overall threat prevalence).

Threat objects that were used in the testing phase bags were not used in the training phase and were not repeated between testing phase bags. Safe items could appear in up to five different testing phase bags and some safe items in testing phase bags also appeared during training. We reserved some safe items for the test phase only and, for other items, limited the number of appearances in the training phase. Across all bag stimuli presented at test, this resulted in: 21 bags with entirely novel (unseen during training) safe items, 53 bags with only one safe item presented during training (five items entirely novel), 58 bags with two safe items presented during training, 27 bags with three safe items presented during training, seven bags with four safe items presented during training and two bags with five safe items presented during training.

While this approach treated threat and safe items differently across training and test phases, it not only allowed us to generate a large number of stimuli, but also meant that these stimuli that better represented real-world baggage screening conditions, where categories of safe item are often highly homogenous (compared to threat items), some of the same safe items may be found in many different bags (e.g. popular brands of mobile phone or laptop computer) and specific threats are difficult to predict or foresee.

While the overall prevalence of threat objects was held at 50%, the relative prevalence of specific categories of threat was manipulated. In all test phases for Experiments 1 to 4, there were seven high prevalence threat categories (brackets show percentage of trials with target present rounded to nearest integer): explosives (5%), firearm magazines/components (6%), large firearms (3%), small firearms (7%), knives/stabbing weapons (7%), liquids/gases (4%), snips/scissors/pliers (5%). There were also seven low prevalence threat categories: grenades (2%), ammunition (2%), blunt weapons/axes (2%), throwing stars/knuckles (2%), power tools (2%), shrapnel (2%), trowels/wrenches (2%). For all experiments, high prevalence threat categories at test also had high prevalence during training and low prevalence threat categories at test also had low prevalence at training, although precise prevalence levels varied due to the number of trials and the number of available stimuli in each category (see Additional file 1: Supplementary Tables S1–S5 for exact numbers of stimuli in each category in all phases/experiments).

Design and procedure

To minimise the likelihood of participants autonomously adopting a standard target-based approach to search prior to training (that could potentially persist after training), Experiment 1 did not utilise a pre-training test phase (see Fig. 1, figure includes stimuli not used in the experiment shown for illustrative purposes only). Search performance was instead assessed after training with a single test phase. Participants were randomly assigned to one of three training groups, one of which included only threat items (target-based search training), one of which included only safe items (standard targetless search training; sTST) and one of which included a combination of threat and safe items (combined search training; CST).

Participants initially completed a short familiarisation phase which involved passively viewing 11 X-ray images of empty suitcases (with some features labelled). These were viewed sequentially and viewing was self-paced. The rationale behind this phase was to familiarise participants both with the appearance of baggage X-ray images in general and with some of the specific features of suitcases, which form the ‘background’ of all stimuli used in the test phase.

Following completion of the familiarisation phase, participants began the training phase. On each trial of the training phase, a single X-ray image of an object (see Apparatus and Stimuli) was presented against a white background and participants were required to indicate the category to which the object belonged by using the mouse to click on one of 14 category labels presented in a list on the left of the screen (see Fig. 2; the category labels present depended on the training group and block). Following the click, feedback was provided which either stated that the selection was correct or indicated what the correct category was if the response was incorrect. The target-based search group completed 196 training trials categorising threat items, the sTST group completed 196 training trials categorising safe items and the CST group completed 98 training trials categorising threat items and 98 training trials categorising safe items. A single object was displayed on each trial and each object was displayed only once. These trials were organised into four equal sized blocks and for the CST group, threat and safe items were presented in separate blocks that were order counterbalanced. The target-based search group did not learn about the 14 safe item categories and the sTST group did not learn about the 14 threat object categories, but the CST group did learn about all 28 object categories.

Following training, participants completed a test phase to assess their performance. All participants were informed that the test phase would involve viewing bags and determining whether they were ‘safe’ or ‘dangerous’ and that bags should be considered safe if they contained only items that were typically allowed in aircraft cabin baggage and dangerous if they contained an item that was typically prohibited in this situation. All participants were instructed to use what they had learned in the training phase to help them complete the test phase, specifically, the target-based search group was instructed to focus on identifying the dangerous items they learned about and to treat items they did not recognise as safe, the sTST group was instructed to focus on ignoring the safe items they learned about and to treat items they did not recognise as dangerous, and the CST group was instructed to focus on identifying the dangerous items and ignoring the safe items they learned about. No instructions were given about threat or safe item prevalence or frequency. In each test trial of the test phase, a central fixation cross was presented for one second, followed by an X-ray image of a bag containing six items (see Apparatus and Stimuli) presented for five seconds. After this time the bag stimulus disappeared, and the participant was prompted to respond to indicate whether the bag was ‘dangerous’ or ‘safe’ by pressing either the ‘z’ or ‘m’ key on the keyboard. Following the experiment, participants were debriefed, informed of the aims of the study and given the opportunity to ask any questions. As a whole the experiment lasted no more than one hour per participant.

Results

We first tested whether the benefits of sTST might reflect repetition of items from that group’s training. To do this, we identified the 74 test phase bags which included zero or one safe items presented during sTST training (‘low/no repeat’ bags) and a further 94 bags which included two or more safe items presented during sTST training (‘high repeat’ bags). Any benefit for the sTST group that derived from benefit of training items should be most evident in the high repeat set of bags. We calculated d′ separately for these two sets of test stimuli, across sTST and target-based training groups (the latter of which had not viewed any safe items prior to test), and conducted a two-way mixed ANOVA with training group (sTST, target-based) and item repetition (high repeat, low/no repeat) as factors. This yielded main effects of training group, F(1,38) = 22.03, p < 0.001, η_G² = 0.31, and of item repetition, F(1,38) = 28.60, p < 0.001, η_G² = 0.15. However, there was no interaction between training group and stimulus repetition, F(1,38) = 2.06, p = 0.159. This provided no evidence that high repeat bags (M_sTST = 1.55, SD_sTST = 0.25, M_target-based = 1.17, SD_target-based = 0.22) conferred a specific advantage for the sTST group over the target-based group relative to low/no repeat bags (M_sTST = 1.34, SD_sTST = 0.25, M_target-based = 0.81, SD_target-based = 0.22). While item repetition did have a main effect on d′ this was consistent across training groups, despite all safe items being novel at test for the target-based group. This likely reflects a combination of stimulus specific effects and recognition benefits accumulated during the test phase for objects which appeared in multiple test phase bags. In any case, stimulus repetition does not appear to explain the benefits conferred by sTST. Further analysis of these differences for bags that contained no repeated items and only a single repeated item is included in Additional file 1: Table S5.

Analysis of sensitivity (d′) and criterion (c) revealed higher d′ scores for those who received sTST and CST, relative to target-based training, and criterion differences between the training groups, mostly notably between the target-based training and sTST groups which were biased towards responding that bags contained the types of item they had viewed during training.

We conducted two between-subjects ANOVAs. There was a significant effect of training group on d′, F(2,57) = 12.99, p < 0.001, η_G² = 0.31. Planned comparisons revealed that d′ for the target-based search training group (M = 1.00, SD = 0.25) was significantly lower than both the sTST group (M = 1.46, SD = 0.38), t(38) = 4.56, p < 0.001, d = 1.44, and the CST group (M = 1.45, SD = 0.33), t(38) = 4.77, p < 0.001, d = 1.51. There was no significant difference between the sTST group and the CST group, t(38) = 0.17, p = 0.867.

There was a significant effect of training group on c, F(2,57) = 17.95, p < 0.001, η_G² = 0.39. Planned comparisons revealed that c was significantly lower for the target-based search training group (M = − 0.18, SD = 0.39) than both the sTST group (M = 0.42, SD = 0.32), t(38) = 5.23, p < 0.001, d = 1.66, and the CST group (M = 0.19, SD = 0.20), t(38) = 3.65, p < 0.001, d = 1.15. The sTST group also had a significantly higher c than the CST group, t(38) = 2.73, p = 0.009, d = 0.86.

Combined results from all experiments are shown in Table 1 and for Experiment 1 in Figs. 3 and 4. In these results, a negative value of c indicates a more liberal response criterion (lower threshold to respond target-present) and a positive value of c indicates a more conservative response criterion (higher threshold to respond target-present).

Table 1 Mean Hit Rates and False Alarm Rates for the Target-based Search Training, Standard Targetless Search Training (sTST) and Combined Search Training (CST) Groups in Experiment 1, Semantic and Alphabetic Training Groups in Experiment 2, Fixed Reward (FR) and Performance-contingent Reward (PR) Groups in Experiment 3 and the Enhanced Targetless Search Training (ETST) and Practice Only (PO) Groups in Experiment 4 (overlap shown where relevant for Experiments 3 and 4; standard deviations shown in brackets)

Full size table

To examine how the effects of training group changed over the time during the test phase, we split the test phase into four equal blocks of 42 trials and plotted d′ and c for each of these (Fig. 5). Visual inspection of this plot shows that both measures remained generally consistent over time, with d′ levelling off after the first block for all three groups, suggesting that participants adapted to the task demands relatively rapidly and the effects of training did not shift over time. This interpretation was borne out in statistical analysis (two separate two-way mixed ANOVAs with test phase block [1,2,3,4] and training group [target-based, CST, sTST] as factors and d′ and c as dependent variables), which showed effects of training group on both d′, F(2,57) = 13.36, p < 0.001, η_G² = 0.15, and c, F(2,57) = 17.46, p < 0.001, η_G² = 0.30, and an effect of test phase block on d′, F(3,177) = 3.84, p_G-Gcorrected = 0.012, η_G² = 0.04. There were no statistically significant interactions between test phase block and training group, Fs < 0.90, suggesting that training effects remained consistent over time and that group differences were not due to strategy carryover effects.

To further characterise task performance between groups, we analysed hit rate and false alarm rate using binomial generalised linear mixed-effects models (GLMMs, see Table 2). All responses were entered into the models as binary values indicating whether or not the response was a hit/miss or a correct rejection/false alarm). Standard ‘treatment’ group comparisons were used such that the sTST and CST groups were each compared with the target-based search training group. All models included participant as a random factor, and in all cases, model fitting started with a full set of interactions and iterated through progressively simpler variants until reaching the best-fitting model (any models that failed to converge were excluded). The results of the models show that there were significant effects of training group on hit rate and false alarm rate, specifically that the target-based training group has a higher hit rate than the sTST group and both the sTST and CST groups had lower false alarm rates than the target-based training group. These results also show that low prevalence target categories were associated with a lower hit rate than high prevalence categories.

Table 2 Generalised linear mixed-effects models comparing hit rate and false alarm rate in the standard (sTST) and combined search training (CST) groups with the target-based search training group (standard errors in brackets)

Full size table

Discussion

Our analysis supported our prediction that participants who received sTST training would be better able to detect targets in the baggage search task (at least in terms of d′), but further analysis revealed more nuanced differences among the three training conditions. The higher d′ scores for those who received sTST and CST training, relative to target-based training, demonstrated that learning about non-targets (at least as much as about targets) did benefit target detection. However, our analysis of criterion differences between the training groups revealed that both the target-based training group and the sTST group were biased towards responding to say that bags contained the type of item they had viewed during training (i.e. the target-based group were biased towards responding target-present and the sTST group target-absent). These differences can also be characterised in terms of the hit rate and false alarm data. While participants who received sTST demonstrated significantly lower hit and false alarm rates than participants who received target-based training, participants who received CST demonstrated a statistically equivalent hit rate to target-based training participants, but also a significantly lower false alarm rate. As predicted, we also observed a prevalence effect in the typical direction, that is to say that lower prevalence target categories were associated with a lower hit rate than higher prevalence target categories.

Together these findings suggest that training that focuses on non-target recognition facilitates the detection of novel targets in a simulated baggage screening task. Contrary to our predictions, rather than pushing participants towards a target-based search strategy, training that focused equally on target and non-target recognition reduced the bias present in the other training conditions for responding consistent with training (we explore the potential benefits of equal threat and safe focus in training further in Experiment 4). Experiment 1 provides evidence that training safe item recognition is an effective approach and identifies important limitations involved in this. Experiment 2 builds on these findings, and previous studies of distractor templates, by aiming to determine whether grouping training stimuli together into superordinate semantic categories can improve target detection.

Experiment 2

Previous studies of distractor templates have indicated that non-targets are primarily represented in terms of their semantic features rather than their visual features (Daffron & Davis, 2015, 2016), this is in contrast to target templates where visual features are prioritised (Godwin et al., 2014). In Experiment 2 we investigated whether semantic grouping of safe items into superordinate categories during training could benefit target detection relative to an arbitrary alphabetic grouping.

We predicted that semantic grouping of safe item categories would allow participants to more effectively reject safe items by developing distractor templates that included broad semantic features common to multiple specific subordinate categories. We anticipate that a benefit might result from the need to maintain fewer distractor templates if safe items from multiple subordinate categories can be effectively rejected according to a single superordinate template for rejection. Finally, we again predicted that we would observe a standard prevalence effect on hit rate as in Experiment 1.