Materials and methods
Participants
A total of 37 participants were recruited for the experiment, five of which were excluded from the final analysis. One participant was withdrawn due to low acuity, three participants were withdrawn for mean reaction times in excess of 1000 ms, and one participant was excluded because we had achieved the needed gender and age distribution. All other participants had normal or corrected to normal acuity, as assessed using both the Federal Aviation Administration’s test for near acuity (Form 8500-1), and the Snellen Eye Chart for distance acuity. All data reported were from a final set of 32 participants (16 men). The sample was additionally divided into older and younger cohorts (16 participants in each; 8 men, 8 women), with the younger participants in the age range of 20–29 years (mean age, 24.1 years) and the older participants in the age range of 60–69 years (mean age, 64.4 years). All participants provided informed consent prior to data collection in accordance with the requirements of MIT’s Committee on the Use of Humans as Experimental Subjects (COUHES) and the Declaration of Helsinki.
Apparatus, stimuli, and procedure
Apparatus. All stimuli were presented using PsychoPy (Peirce, 2007, 2008) on a Mac Mini (Apple Computer, Cupertino, CA, USA). Stimuli were displayed on a 68 cm Acer LCD display (Model B276HI) at a resolution of 1920 × 1200 pixels with a refresh rate of 60 Hz and a viewing distance of 70 cm. Head position was unconstrained, allowing for a degree of positional variability likely to be encountered in real-world viewing scenarios. Participants performed the task in a dimly lit (~10 lux) room.
Stimuli. All stimuli were six-letter words or non-words, as used by Dobres et al. (2016) with the words originally selected from the MCWord database of unique wordforms by Medler and Binder (2005). Stimuli in the experiment were generated in the humanist sans serif typeface Frutiger, for comparability with previous work by the co-authors (c.f. Dobres et al., 2016; Dobres, Chahine, Reimer, Mehler, & Coughlin, 2014), and rendered at 4 mm (0.33°) capital letter height onscreen. While we use capital letter height as the measure of optical size, in accordance with previous work in this area, all stimuli consisted of lowercase letters. Non-degraded stimuli consisted of white text (223 cd/m2) on a black (0.34 cd/m2) background (negative polarity), measured at the display surface with a Gossen Mavo-Monitor luminance meter. Negative polarity was used to maximize observed differences between conditions, based on previous work with this typeface by the authors. Negative polarity is commonly used for in-vehicle displays under low ambient illumination conditions.
To assess the differential impacts of blur and noise, respectively, on legibility for older and younger participants, we used two independent degradation conditions in our experiments. To simulate the reduced sensitivity to high spatial frequencies, on some trials we blurred our stimuli. On other trials, to approximate the effects of broadening of neuronal tuning, we presented our lexical stimuli in a field of noise (see Fig. 1) to diminish their discriminability (Damera-Venkata, Kite, Geisler, Evans, & Bovik, 2000; Michel, Chen, Geisler, & Seidemann, 2013). While these degradations are imperfect representations of the effects of aging, these transformations allow us to examine specific facets of age and legibility. We note that the gradual nature of aging means that our older participants may have developed compensatory strategies for similar changes in their visual systems; however, the synthetic nature of our degradations should reduce the effectiveness of any compensatory strategies.
In the trials where we blurred the stimuli, this was accomplished by convolving full contrast text images with a Gaussian kernel of different sizes to achieve different levels of blur. The standard deviations of the Gaussian blur kernels used in this experiment were 4.3, 5.8, 8.7, and 11.5 arcmin (for our 70 cm viewing distance), based on pilot testing. Increasing the standard deviation increases the image blur and decreases the available resolution. In our noise trials, we added a field of 1/f noise to the text image (Fig. 1a) at different levels of noise contrast. Noise contrast levels were chosen based on pilot testing to assess a full range of performance, from ceiling to chance. Noise patches were 2.4° high and 4.8° wide and had one of four contrast levels: 50, 65, 80, and 95%. The contrast of the full image (noise with text) was maintained at 100% for each noise contrast condition. Both the blur and noise conditions also included a no-degradation condition (0 arcmin of blur, 0% noise contrast) as a baseline for a total of five levels in each condition.
Procedure. Each trial consisted of the following sequence (Fig. 1b). First, a precue was presented for 1000 ms at the center of the screen to indicate the region where the lexical stimulus would be presented. The precue consisted of four “L” shapes (0.48° on a side) rotated and positioned to form the corners of a rectangle subtending 4.8° horizontally and 2.4° vertically. No stimuli were presented outside the region indicated by the rectangular cue. This was followed by a 200 ms screen-centered mask consisting of a string of eight random punctuation characters (selected with replacement from: =, ^, <, >, and |). Following this mask, participants were shown a set of letters that had an equal probability of forming a word or a non-word. Six-letter words and non-words were selected randomly without replacement from separate lists of 299 and 291 alternatives, respectively, and had a randomly selected level of either blur or noise. All word and non-word stimuli were presented for 250 ms, immediately followed by a different random punctuation mask, presented for 200 ms.
Following the final mask, participants were instructed to respond as to whether the lexical stimulus was a word or a non-word by pressing a key on the keyboard. They were given a warning display if they took longer than 5000 ms to respond. Trials in which reaction times exceeded 5000 ms were excluded from the analysis (0.016% of all trials). There were 20 trials for every unique combination of stimulus category (word versus non-word), degradation type (noise versus blur), and degradation level (five levels) for a total of 400 trials per participant. Trial order was randomized for each participant and the experiment was divided into eight blocks of 50 trials with breaks between each block.
Prior to the start of the experiment, participants performed a small set of practice trials until they had correctly completed five consecutive trials. In these practice trials, the lexical stimuli were presented without any blur or noise, generated in the typeface Georgia, and presented for 1000 ms, rather than the 250 ms in the main experiment. Participants also received visual feedback regarding their accuracy on each trial during the practice phase of the experiment. No feedback regarding accuracy was provided during the main experiment.
Analysis. For each participant and type of degradation (blur and noise), we used maximum likelihood estimation to fit a two-parameter psychometric function, a cumulative Normal to the lexical decision accuracy as a function of degradation level:
$$ \Phi (x)=\frac{1}{2}+\frac{1}{2\sigma \sqrt{2\pi }}{\displaystyle \underset{-\infty }{\overset{x}{\int }}}{e}^{-\frac{{\left(t-\mu \right)}^2}{2{\sigma}^2}}dt $$
(1)
where μ represents the mean (horizontal shift) and σ represents the standard deviation (slope). Mean goodness of fit for the blur condition, averaged across participants was, R2 = 0.93; for the noise condition, mean R2 = 0.81. The critical question is how the performance curves differ for older versus younger participants. To this end, differences between age cohorts were tested with two-tailed unpaired Welch’s t-tests and effect size was determined using Cohen’s d.
In addition to this fit-based analysis, we also performed an accuracy-based analysis for both the blur and noise conditions, in which we compared percent correct performance between the two age groups, at each level of degradation, as an additional verification of our findings in the fit-based analysis. We performed two separate mixed-model ANOVAs, one for each degradation type (noise and blur), with age group as a between-subjects factor and the five degradation levels (either noise contrast or blur in arcminutes) as a within-subjects factor. Effect sizes are reported as eta-squared.
Finally, each comparison includes an estimate of the corresponding Bayes factor of the alternative hypothesis (H1) against the null (H0), reported as BF
10
, and calculated using the Jeffrey-Zellner-Siow prior (Zellner & Siow, 1980). Values of BF
10
that are greater than 1 indicate that the observed data are more likely under the alternative than the null. The converse is true for values of BF
10
that are less than 1 (i.e. the observed result is more likely under the null).
Results
Analysis of psychometric functions for older versus younger adults
While both types of degraded trials were interleaved in our experiment, we will discuss them separately for clarity, as they are two entirely independent stimulus manipulations.
In the blur condition, we find a significant shift in the psychometric function between older and younger observers (t(28.9) = 3.57, p = 0.001, d = 1.26, BF
10
= 25.89). Specifically, it is useful to consider the midpoint of the psychometric function, the 75% correct threshold. Compared to younger observers, accuracy for older observers dropped to 75% correct at a lower level of blur (2.95 versus 4.43 arcmin; Fig. 2a; see Fig. 2b for threshold by age group and Fig. 2d for exemplar individual participant data). Similarly, in the noise condition, older observers had lower 75% thresholds (i.e. worse performance) than younger observers (t(27.2) = 3.70, p < 0.001, d = 1.31, BF
10
= 34.34). Accuracy for older observers dropped to 75% at a lower noise contrast level than it did for younger observers (58.8 versus 70.3% contrast). Therefore, in order to equate performance between younger and older participants in the blur condition, the Gaussian kernel SD would need to be increased by 1.48 arcmin for the stimuli presented to the 20–29 age group. To do the same in the noise condition, an additional 11.5% noise contrast would need to be added. The group thresholds are visualized in Fig. 2c.
To determine whether there was any difference in how steeply performance declines (as blur or noise increases), we compared the fitted slope parameters (σ) between the age groups. There was no difference between the 20–29 age group and the 60–69 age group for either the blur (2.72 versus 2.28, t(29.9) = 1.11, p = 0.28, d = 0.39, BF
10
= 0.54) or the noise conditions (0.25 versus 0.34, t(27.9) = −1.50, p = 0.15, d = 0.53, BF
10
= 0.78). Therefore, any differences between the age groups are best summarized as a lateral shift in the psychometric function, without a difference in slope.
As we will discuss later, knowing the shift of the psychometric function with age is particularly useful for providing design intuitions, because it provides a single value that describes the differences between older and younger participants. One can, of course, also look at other points on the curve, if that is of relevance for a particular research question, e.g. how would we expect older adults to respond to a slightly blurred user interface compared to younger adults. If the psychometric functions were perfect cumulative Normal functions with no change in slope nor asymptotic performance, we would observe the same shift in 90% thresholds as we observe in 75% thresholds, but of course none of these assumptions holds exactly. In the blur condition, we observed a trending difference in the 90% threshold between the two age groups, with older observers’ performance dropping to 90% at a lower level of blur compared to younger observers (1.04 versus 2.16 armin, t(29.28) = 2.0, p = 0.06, d = 0.71, BF
10
= 1.47). In the noise condition, the difference between the two age groups was significant, t(27.05) = 2.48, p = 0.02, d = 0.88, BF
10
= 3.13. Compared to younger observers, older observers required a lower level of noise in order for performance to drop to 90% (30 versus 49% contrast).
Accuracy analysis
While the differences between the psychometric functions detailed in the previous section are highly informative, it is also valuable to verify those results using a complementary method. In the blur condition, an ANOVA on percent correct responses showed a significant main effect of age group (F(1,30) = 12.13, p = 0.002, η
2 = 0.29, BF
10
= 10.63), with lower accuracy in the 60–69 age group than in the 20–29 age group (63.0 versus 67.8%, respectively). As expected, there was a significant main effect of blur level (F(4,120) = 307.31, p < 0.001, η
2 = 0.89, BF
10
= 4.10 × 1062), with performance decreasing as the level of blur increased. This result would be expected regardless of any differences between the two age groups and indicates that our blur manipulation reduced accuracy in the lexical decision task. The interaction between age group and blur level was also significant, F(4,120) = 9.02, p < 0.001, η
2 = 0.03, BF
10
= 1.34 × 104. Comparisons between the two age groups at each level of blur (using a Šidák-corrected alpha of 0.01) showed a significant effect at 4.3 arcminutes (t(30) = 4.14, p < 0.001, d = 1.47, BF
10
= 95.48). The difference between the two age groups was not significant at any of the remaining blur levels, including the no-blur condition (all p values > 0.08, BF
10
< 1.15).
An ANOVA on participants’ performance in the noise condition yielded similar results. There was a significant main effect of age group (F(1,30) = 19.12, p < 0.001, η
2 = 0.39, BF
10
= 85.19), with lower overall accuracy in the 60–69 age group than in the 20–29 age group (73.4 versus 78.4%). The effect of noise level was also significant (F(4,120) = 532.07, p < 0.001, η
2 = 0.94, BF
10
= 1.49 × 1082), indicating that the noise manipulation reduced participants’ performance. Finally, we observed a significant interaction between age group and noise level, F(4,120) = 4.55, p = 0.002, η
2 = 0.008, BF
10
= 23.43. The difference between the 20–29 age group and the 60–69 age groups was significant at both the 65% (t(30) = 3.98, d = 1.41 p < 0.001, BF
10
= 64.61) and the 80% noise contrast levels (t(30) = 3.53, d = 1.24, p = 0.001, BF
10
= 23.98). The difference between the age groups was not significant at the remaining contrast levels (all p values > 0.09, BF
10
< 1.07), including the no-contrast level.
Together, the results from the accuracy analysis are consistent with the psychometric fitting results. In both the blur and the noise conditions, we see a significant difference between the two age groups only at intermediate levels of blur (or noise) and not at the extremes (i.e. the lowest and highest levels of blur or noise). This pattern of results is consistent with a lateral (horizontal) shift of a sigmoid function, which produces larger differences in the y-values (percentage correct) at intermediate x-values (e.g. intermediate levels of blur) and a smaller difference at the extremes.
Reaction time
Finally, we analyzed observers’ mean reaction times using a separate 5 (degradation level) × 2 (age group) mixed-model ANOVA for each degradation type. In the blur condition, there was a significant main effect of age group (F(1,30) = 15.51, p < 0.001, η
2 = 0.34, BF
10
= 47.97), with older participants responding more slowly than younger observers (620.8 ms and 399.6 ms, respectively). Neither the main effect of blur level (F(4,120) = 0.81, p = 0.52, η
2 = 0.03, BF
10
= 0.06) nor the interaction between blur level and age group reached significance (F(4,120) = 0.89, p = 0.47, η
2 = 0.03, BF
10
= 0.15).
In the noise condition, the main effect of age group was also significant (F(1,30) = 18.94, p < 0.001, η
2 = 0.39, BF
10
= 137.61), with slower mean reaction times in the 60–69 age group (619.7 ms) than the 20–29 age group (392.0 ms). Unlike the blur condition, there was a significant main effect of degradation level, F(4,120) = 3.64, p = 0.008, η
2 = 0.11, BF
10
= 4.36. A trend analysis showed a significant linear trend, indicating that reaction times increased with increasing noise contrast (F(1,30) = 5.23, p = 0.029, η
2 = 0.15) and pairwise comparisons (with a Šidák-corrected alpha of 0.005) showed significantly slower reaction times in the 65% noise condition (517.6 ms) compared to the 50% noise condition (472.0 ms), t(31) = −3.29, p = 0.003, BF
10
= 14.46. All other pairwise comparisons did not reach significance (p > 0.01). Finally, the age group × noise level interaction was not significant (F(4,120) = 0.79, p = 0.53, η
2 = 0.02, BF
10
= 0.13).
Together, these results point to fast lexical decision judgments (with a mean reaction time across age groups of 508.0 ms), with older adults responding more slowly than younger adults in both conditions by more than 200 ms on average. In addition, we observe longer reaction times with increasing noise contrast, indicating that, at least in some cases, reaction times were modulated by task difficulty.
Discussion
Two findings stand out from this experiment. First, that in the absence of degradation, older and younger participants are both capable of performing our lexical decision task at a high level of accuracy, even if older participants are slower to do so. Second, and much more interestingly, that degraded stimuli, both blurred and with added noise, have a greater detrimental effect on legibility for older participants than younger participants, and that this change can be best and most simply described as a horizontal shift of the function used to fit the data. The fact of this horizontal shift means that it is entirely possible, based on data collected under the low ambient illumination conditions used in this experiment, to simulate the difficulty an older observer has performing the task with a given stimulus and give a younger observer an intuitive appreciation of the differences in their respective perceptions.