Information from non-visible parts of the electromagnetic spectrum is beneficial for determining different types of environmental information in many operational settings (Hall & Llinas, 1997). For example, long-wave infrared (LWIR) emissions are useful for detecting heat information (e.g., occluded heat-producing objects, such as a person behind a bush), and short-wave infrared (SWIR; e.g., night vision) can pick up detail in conditions with low illumination. Together, infrared and visible sensors may supply the operator with complementary information and aid in a task such as determining a target’s location (e.g., a person) relative to an object in the scene (Toet, Ljspeert, Waxman, & Aguilar, 1997).
There are several alternative ways to present an observer with multiple sensor images simultaneously. A common family of approaches, which we refer to as algorithmic fusion, is to combine relevant information from two sensor images into one composite image (Burt & Kolczynski, 1993). Alternatively, information from each sensor could be displayed in two separate images. Presenting all available information moves the choice of relevant information to the operator rather than relying on an algorithm to detect useful sensor information.
Algorithmic fusion has been the focus of much of the research on presenting multi-spectral information. This is due to two potential benefits of the technique: (1) algorithmic fusion restricts the number of sources of visual information to which the operator must attend and (2) the resultant image may possess emergent features not found in either single image alone (Krebs & Sinai, 2002). A potential downside to algorithmic fusion is that some information from the individual sensors must be filtered out in the process of creating a single image (Hall & Steinberg, 2000). There are many options for algorithmic fusion, and the choice of algorithm does offer some freedom in determining what information is lost, but information is necessarily lost.
In some domains, giving complete information to an operator, particularly expert operators, leads to advantages (cf. Klein, Moon, & Hoffman, 2006). In the image fusion literature, the process of an operator using information from multiple separate images for a task is often referred to as cognitive fusion (cf. Blasch & Plano, 2005) because any potential integration of the two images must take place cognitively. Cognitive fusion is a moniker we will adopt for the rest of this paper. Note that cognitive fusion refers to performance using separate images, not necessarily a particular form of cognitive or perceptual process.
In this paper, we suggest the use of a cognitive-theory-driven approach based on performance, systems factorial technology (SFT), for evaluating image fusion approaches, particularly for comparing algorithmic to cognitive fusion. This approach allows for both more theoretically meaningful measures than raw accuracy or response time (RT), and for insight into the particular aspects of the cognitive process that may have led to better or worse performance. We will begin by briefly reviewing the existing approaches to evaluating image fusion. Next, we review SFT, then apply the methodology to compare algorithmic fusion (in this case Laplacian pyramid fusion, which we describe below) to cognitive fusion (side-by-side image presentation).
Fusion assessment
Image fusion is mostly studied within the field of computer vision, hence the vast majority of the metrics of fusion quality are based on computational principles. One of the more common measures is of the preservation of edge information, at the individual pixel level (Xydeas & Petrović, 2000); the local, 8×8 pixel grid level (Piella & Heijmans, 2003); or the global image level (Petrović & Xydeas, 2004; Qu, Zhang, & Yan, 2002). These image-level metrics are valuable in that they provide an objective assessment of the amount and quality of information from each single sensor that is represented in the composite image for minimal cost. Two major deficits of limiting assessment to image quality metrics is that they do not account for task-relevant information and are not always predictive of human performance (Smeelen, Schwering, Toet, & Loog, 2014).
To address the shortcomings of computer-based image quality metrics, subjective user experience questionnaires (asking for example, overall reported image preference, comfort, etc.) are used (Krishnamoorthy & Soman, 2010; Petrović, 2007). This approach offers a partial solution, but subjective quality assessments can also fail to predict variation in performance. Furthermore, when they are used, user experience assessments are only used for outcome assessment and not to inform directly the design process (Toet et al., 2010). Hence, while the subjective quality of a display yields some benefits, to gain understanding of what design aspects lead to better decision-making and human performance and inform the design of new fusion approaches, it is important to measure directly human performance on a specific task (cf. Blum, 2006; Dixon et al., 2006; Dong, Zhuang, Huang, & Fu, 2009).
Despite being a relatively limited literature, human performance with fused imagery has been used with a range of basic visual tasks including detection (Krebs et al., 1999), discrimination (e.g., whether a global scene is upright or vertically inverted; Krebs & Sinai, 2002, Toet et al., 1997), recognition (Ryan & Tinkler, 1995; Sinai, McCarley, & Krebs, 1999; Toet & Franken, 2003), and visual search (Neriani, Pinkus, & Dommett, 2008). This research has been conducted in contexts including aviation (Ryan & Tinkler, 1995; Steele & Perconti, 1997) and surveillance (Neriani et al., 2008; Toet & Franken, 2003; Toet et al., 1997). Among these applications, there is a wide range of reported results and overall conclusions. Such discrepancies are potentially due to methodological variation (Ahumada & Krebs, 2000; Essock, Sinai, McCarley, Krebs, & DeFord, 1999; Steele & Perconti, 1997), differences in task descriptions (Krebs & Sinai, 2002; McCarley & Krebs, 2000), and variation in fusion algorithms or sensor combinations (McCarley & Krebs, 2000; Neriani et al., 2008). Additional manipulations often cited in the literature are task type and difficulty, image scene, sensors, and fusion algorithms (Krebs & Sinai, 2002; McCarley & Krebs, 2000). Thus far there is no standard way to compare across manipulations that controls for the amount and type of information provided by each component image.
In many of these studies, performance with composite images was compared to performance with an individual sensor (e.g., LWIR plus visible compared to visible alone). Unfortunately, this comparison confounds whether image fusion enhances performance because of the fusion method implemented or simply because it supplies more information to the observer. We are concerned with answering the question of whether the observer is processing each sensor image as efficiently in a multi-sensor context as when presented in isolation. To answer this question effectively, we must compare performance with multiple sensors to predict how well they should perform given their performance with each individual sensor image.
When an observer is provided with two sensor images, regardless of the display type, they have redundant information to inform them of the correct decision, thereby suggesting an overall faster response. Although it may seem intuitive to equate a performance gain with redundant signals with facilitatory processing, parallel processes with no facilitation can predict significant redundancy gains (Duncan, 1980; Kahneman, 1973; Miller, 1982; Raab, 1962; Townsend & Wenger, 2004). Furthermore, performance decrements may still be observed relative to single-source imagery due to our perceptual system dealing with multiple pieces of information (cf. Townsend & Ashby, 1983; Townsend & Wenger, 2004). Thus, it is important to use an appropriate baseline for assessing the gain (or loss) due to an added signal. The capacity coefficient, a measure from SFT that we describe in detail in the next section, addresses this issue because it uses individual source performance to predict what performance would be in a multi-signal context under a baseline model assumption.
By using SFT, we go beyond the simple better/worse distinctions that are possible with the previously applied metrics. SFT allows us to examine the reason for observed performance differences including the differential effects of increasing the amount of available information (i.e., processing efficiency), facilitation or inhibition between the perception of each source of information, whether processing one image source is sufficient or both sources must be processed, and the temporal organization of the perception (i.e., serial versus parallel).
Systems factorial technology
To examine the basic perceptual processing of cognitively and algorithmically fused imagery, we applied SFT. The SFT framework supplies information about important cognitive properties including workload capacity, independence, architecture, and the stopping rule. Workload capacity refers to the change in the processing rate of information of an individual sensor when going from single- to multi-sensor presentation. Independence is the degree to which the processing of each type of sensor information influences the processing of the other. Architecture refers to whether processing is simultaneous (parallel processing), sequential (serial processing), or information is pooled (coactive processing). The stopping rule refers to whether one or both sensors must be finished processing when a response is made (e.g., OR or AND).
These SFT constructs are measured using two statistics. The capacity coefficient is used to examine workload capacity and independence. Thus, it is useful for examining how the cognitive processes involved for each source of information (e.g., each sensor image) speed up or slow down as more sources are simultaneously presented (e.g., multiple sensors). The survivor interaction contrast (SIC) is used to examine architecture and the stopping rule, i.e., the SIC is useful for examining the temporal organization of information and the extent to which one or both sensors are processed to completion.
Capacity coefficient
The capacity coefficient is the ratio of observed performance with multi-sensor information to a model-based prediction of performance. The model prediction is unique to each individual and task and is based on an individual’s performance with single-sensor images. To predict performance, the model assumes unlimited capacity, and independent and parallel processing (UCIP). The unlimited capacity assumption means that the processing rate of the individual sensor images is the same whether they are presented in isolation or with the other source (cognitively or algorithmically fused). Independent processing indicates that the distribution of processing times for one source does not change based on processing of the other source. Parallel processing indicates that all sensor information is processed simultaneously.
The formal prediction of the UCIP model for OR processing can be stated in terms of the integrated hazard function, H(t), which indicates the amount of processing completed up to a given time (t). For an OR process, the integrated hazard function of the UCIP model is the sum of the integrated hazard functions for each individual process that operates in the parallel system, i.e.,
$$H_{\text{multi-sensor}}^{\text{UCIP}}(t) = H_{\text{visible}}(t) + H_{\text{LWIR}}(t). $$
By using an individual participant’s performance on the visible-only trials to estimate their H
visible(t) and likewise for H
LWIR(t), we arrive at an individualized estimate of what H
multi-sensor(t) would be if that participant were using a UCIP strategy.
The capacity coefficient is the ratio of a participant’s actual hazard function when both sources of information are available to their predicted performance if their processing met the UCIP assumptions:
$$ C_{\text{OR}}(t) = \frac{H_{\text{multi-sensor}}(t)}{H_{\text{UCIP}}(t)}. $$
(1)
The numerator of Eq. 1 is the integrated hazard function for multiple sources of information presented simultaneously and the denominator is the summation of the integrated hazard functions of performance for each single source presented in isolation. If C(t)=1, capacity is classified as unlimited, which occurs if all the UCIP assumptions are met. Deviation from one occurs if one or more assumptions of the UCIP model are violated. C(t) less than 1, referred to as limited capacity, can occur if processing each source is slower with more sources present (e.g., due to limited attentional capacity), if there is inhibition among the processes, or if processing is serial rather than parallel. C(t) greater than 1 (super-capacity) implies better performance than a UCIP model and can be due to facilitation between processes including coactive processing.
For inferences regarding the capacity coefficient, we used the standard normal scale statistic (z) derived in Houpt & Townsend (2012) to test individual-level deviation from the UCIP model. For a group-level assessment, we applied either t-tests or ANOVA to the individual-level z scores as appropriate to the hypothesis.
Survivor interaction contrast
The SIC is used to examine whether multiple sources of information are processed serially, in parallel, or information is pooled together (coactive) and if one (OR processing) or both (AND processing) sensors are processed to their entirety. Inference based on SICs is done by examining the interaction between slowing down and speeding up cognitive processing of each individual source. We use S(t) for the survivor function (i.e., the probability that a participant has not responded by a given time) and indicate the level of the salience manipulation by the subscript of S(t). High salience conditions are denoted H and low salience conditions are denoted L. Throughout this paper, the first subscript indicates the level of the LWIR signal and the second subscript indicates the level of the visible sensor. For example, the survivor function of the RTs when LWIR is high salience and visible is low salience is denoted S
HL(t). Using this notation, the SIC is defined as:
$$ \text{SIC}(t)= \left[S_{\text{LL}}(t)-S_{\text{LH}}(t)\right]-\left[S_{\text{HL}}(t)-S_{\text{HH}}(t)\right]. $$
(2)
The manipulations that speed up or slow down processing, known as the salience manipulations, must affect only the speed of processing for the respective source of information, a property known as selective influence (Ashby & Townsend, 1980; Dzhafarov, 2003). If the manipulation is effective and selective influence holds, the fastest responses are made when both sources have high salience and slowest when both sources have low salience. If affective selective influence manipulations are used, each of the five classes of models predicts a unique SIC shape (see Fig. 1; Dzhafarov, Schweickert, & Sung, 2004; Houpt & Townsend, 2011; Townsend & Nozawa, 1995; Zhang & Dzhafarov, 2015).
Positive and negative SIC deviations from zero are tested using the Houpt–Townsend statistic (Houpt & Townsend, 2010) and are used to reject candidate processing models. Specifically, the statistic tests for significant deviations from zero of both the largest positive (D
+) and largest negative (D
−) value of the SIC curve. If the cognitive process follows a serial-OR rule, the predicted SIC is flat and hence neither D
+ nor D
− should be significant. A parallel-AND model implies an all negative SIC, which should lead to a significant D
− but non-significant D
+. A parallel-OR implies an all positive SIC, hence a significant D
+ but non-significant D
−. Both a serial-AND and coactive model result in an SIC that is first negative then positive so both D
+ and D
− should be significant.
Rather than using the traditional conservative cutoff for statistical significance (α=0.05), we use α=0.33 for our applications of the Houpt–Townsend statistic. Typically, α is set to be biased towards indicating a non-significant effect to limit Type I errors. The null hypothesis for the Houpt–Townsend statistic is SIC(t)=0 for all t and hence conservative α levels bias the tests toward indicating a serial-OR signature (flat SIC). While this approach has worked well for model recovery in simulated data (Houpt, 2014), we also applied a recently developed hierarchical Bayesian analysis to the mean interaction contrast (MIC), which we introduce next, to corroborate conclusions from the Houpt–Townsend statistics.
Mean interaction contrast
Positive and negative SIC deviations from zero are tested using the Houpt–Townsend statistic (Houpt & Townsend, 2010) and are used to classify the unique processing model; however, these tests can be less statistically powerful than mean level tests because they target distributional level properties. Hence, in some cases it is advantageous to analyze the mean interaction contrast:
$$ \text{MIC}(t)= \left[M_{\text{LL}}(t)-M_{\text{LH}}(t)\right]-\left[M_{\text{HL}}(t)-M_{\text{HH}}(t)\right]. $$
(3)
MIC predictions for each class of models can be easily derived from the SIC predictions by noting that the integral of the survivor function of a positive random variable is equal to its mean. This implies the area under the curve of the SIC is the MIC. Thus, if processing is parallel (all positive SIC or all negative SIC) then the MIC is nonzero (positive for parallel-OR and negative for parallel-AND). The coactive SIC has both positive and negative ranges, but the positive region is larger, hence the predicted MIC is positive. In contrast, both serial models predict an MIC equal to zero: The serial-OR model has a flat SIC, so the area under the curve is zero. The serial-AND model has both positive and negative regions of the SIC, but they are equal in area so the area under the curve is zero.
The MIC is useful in distinguishing between serial-AND and coactive processes. While both processes imply positive and negative regions of the SIC curve (and, hence, significant D
+ and D
−), the coactive model predicts MIC>0, while a serial-AND model implies MIC=0.
A hierarchical Bayesian analysis can estimate a full posterior distribution for both group- and individual-level inferences regarding the MIC (Houpt & Fifić, 2013). Furthermore, this analysis allows for direct comparison between a zero MIC and a positive/negative MIC instead of relying on null-hypothesis significance testing. In this analysis, we used a prior distribution over models in which MIC=0 was the most likely (50%) while MIC>0 and MIC<0 are less likely with equal probability (25%). This prior was based on the assumption that the possible classes of models were equally likely. Serial-OR or serial-AND each imply MIC=0, parallel-OR implies MIC>0, and parallel-AND implies MIC<0. From these analyses, we will report the group-level posterior probability of MIC=0, which we denote \(\hat {p}_{\text {posterior}}^{0}\); MIC>0, which we denote \(\hat {p}_{\text {posterior}}^{+}\); and MIC<0, indicated by \(\hat {p}_{\text {posterior}}^{-}\). We also report the range of individual-level posterior probabilities for each classification of MIC results, positive, negative, or zero.
Although the hierarchical Bayesian approach offers advantages over the Houpt–Townsend statistic, because it focuses on the MIC, it cannot detect the features of the SIC that discriminate between the serial-OR and serial-AND SIC (MIC=0 for both) and between the parallel-OR and coactive predictions (MIC>0 for both). Hence, we report both the Houpt–Townsend statistics and the results of the hierarchical Bayesian MIC analysis below.
Hypotheses
The use of SFT allows us to examine the underlying processes to help explain why we may see performance benefits of a particular operator display. Each variation in processing structure may inform the cause for a particular pattern of performance. If participants are presented with task-relevant yet redundant information across sensors, they may adopt a processing strategy in which information from only one sensor is used to make the decision (i.e., OR processing or first-terminating). OR processing may combine with either a parallel- or serial-processing structure: either information from both sensors is processed simultaneously but only the fastest to finish is used to make the discrimination (parallel-OR) or information from one sensor is processed and is used for the decision while the alternative sensor is not processed (serial-OR). Alternatively, individual sensor images may each contribute unique complementary information forcing participants to process both sensors entirely to make a correct decision (AND processing). AND processing may also combine with either a parallel- or serial-processing structure: both sensors are processed simultaneously and the slowest to finish is used to make the discrimination (parallel-AND) or both sensors are fully processed, first one, then the other (serial-AND). Fusion also allows for a single percept in which all information is processed in parallel and is pooled to make a decision (coactive processing).
Here we discuss what particular processing mechanisms suggest on a more conceptual level about visual cognition for each presentation type: algorithmic and cognitive fusion.
For algorithmically fused images, standard serial and parallel architectures may be possible, although are a priori unlikely. An interpretation of such a finding would be that participants can selectively attend to one particular spatial frequency information based on the distinctive features to complete the task (Morrison & Schyns, 2001). Alternatively, if observers are unable to extract information selectively from each perceptual dimension, as indicated by McCarley & Krebs (2006), then a coactive or interactive parallel process is more likely (cf. Eidels, Houpt, Pei, Altieri, & Townsend, 2011). For algorithm-fused imagery, we hypothesize: (1) individuals’ efficiency will be at least as high as respective UCIP predictions (i.e., unlimited capacity) across all discrimination stimuli and (2) individuals will use a highly interactive parallel mechanism for processing the multi-sensor information.
When images are presented beside one another (i.e., cognitive fusion), people may process each sensor image in series or in parallel. If processing both images requires visual attention shifts between the two images, then it may be more likely that the images are processed in series. This mechanism limits performance by the constraints of mental integration across several samples of information (Irwin, 1991; Rayner, McConkie, & Zola, 1980). However, serial processes can lead to efficient processing if information from only one image is sufficient for adequate judgments and the additional image is redundant and potentially unnecessary (Neriani et al., 2008).
Alternatively, people may process and potentially integrate the two images in parallel, leaving the opportunity for facilitation in judgment performance due to pictorial redundancy speed-ups (Pollatsek, Rayner, & Collins, 1984), which would imply facilitatory parallel or coactive processing. In contrast, if processing the information across two images is a larger drain on attentional resources, degrading performance with each image (Rousselet, Fabre-Thorpe, & Thorpe, 2002; Scharff, Palmer, & Moore, 2011), inhibitory parallel processing would be observed. Our hypothesis for cognitive fusion focuses on predicting a processing strategy that yields:
-
1.
Performance is no worse than algorithmic fusion. Therefore, individuals’ efficiency will be at least as high as respective UCIP predictions (i.e., unlimited capacity) and across all discrimination stimuli.
-
2.
Individuals will use efficient parallel mechanisms for processing the multi-sensor information.
The cognitive processes involved with utilizing information from multiple sensors may vary from the processing of one sensor image. A cognitively motivated baseline model can encode a specific set of processes so that systematic deviations from the baseline will give evidence for how the processes have changed. Furthermore, using a standardized method to assess deviations of actual performance from predicted performance given the individual parts yields a flexible approach to make comparisons of human processes across several experimental manipulations such as alternative sensors, stimuli, and fusion methods.