Method
Our complete method for both experiments, including sampling plan and reported statistical analyses, was preregistered at the Open Science Framework (OSF; https://osf.io/h7u68/). We analyzed our data using Bayesian t-tests (Rouder, Speckman, Sun, Morey, & Iverson, 2009). One advantage of Bayesian analyses is the option to stop gathering data once a desired result has been obtained (Rouder, 2014; for a mathematical proof see Deng, Lu, & Chen, 2016). We therefore planned to collect data in increments of 40 people, stopping either when 1) the Bayes factors supported the null or alternative hypothesis by a magnitude of 3 or greater or 2) when we had collected data from 200 people.
Participants
We recruited 97 people from Amazon’s Mechanical Turk Service. We initially collected data from 120 people, and then excluded participants who 1) did not complete every phase of the experiment, 2) started the experiment multiple times, 3) reported experiencing technical problems, 4) did not indicate that they were fluent in English, or 5) reported seeing our stimuli before.
Design
We used a two-level (AV quality, fluent or disfluent) within-subject design.
Stimuli
Stimuli were four simulated video interviews, each featuring a different actor. All actors were filmed in the same location. The actors were a Caucasian female, an Indian male, an Asian female, and an African-American male. We made two versions of each video: a fluent version, which was kept at maximum AV quality, and a disfluent version, which was edited using Final Cut Pro X so that the visual and sound quality were degraded (these videos are also available at https://osf.io/h7u68/). Visual quality was manipulated by adding freeze frames to simulate picture freezing during the interview and by adding a light-balance distorting visual filter. Sound quality was manipulated with a high-pass audio filter with a cutoff frequency of 6900.0 and a resonance of 0. (In-video volume was increased to partially counteract the volume difference between the fluent and disfluent videos.) The audio feed never paused, so participants were able to hear every word spoken in the video, but there was background static noise. The durations of the videos were 105, 116, 156, and 173 s. Most actual interviews are not this brief, but impressions formed in a few seconds often match up closely with impressions formed over the course of hours (Ambady & Rosenthal, 1992). There was no difference in duration between the fluent and disfluent videos of the same actor.
Procedure
Participants were told that they would be watching segments from four interviews for a legal secretary position and that they would rate the candidates once they had watched all the videos. They were not told that AV quality would vary between videos. The videos were presented in the same order for every participant. The fluency of the videos was randomly selected from one of two predetermined arrangements: 1) the first and last videos were disfluent or 2) the middle two videos were disfluent.
We tried to ensure that participants were paying attention in two ways. First, a button with the label “Press me now” would periodically appear onscreen as the videos played; participants were instructed to click this button as quickly as possible. Second, immediately following each video, participants were asked three basic questions about the candidate’s responses (e.g., “Where did the candidate say they attended college?”).
After all of the videos had been viewed, participants rated how hirable each candidate was on a scale from 1 (“I would never hire this person”) to 10 (“I would certainly hire this person”). The ratings were made in the same order that the interviews were seen. Participants then cycled through all candidates again, rating each candidate on likeability from 1 (not at all likeable) to 10 (extremely likeable).
Results and discussion
As noted previously, we analyzed our data using Bayesian t-tests (Rouder et al., 2009). We will report Bayes factors in terms of support for the alternative hypothesis (BF10). A BF10 greater than 1 indicates support for the alternative and a value less than 1 indicates support for the null. We consider values greater than or equal to 3 (or less than or equal to 0.33) as offering convincing evidence for the alternative (or null) hypothesis. In our analyses, a BF10 ≥ 3 will always correspond to a p < 0.05.
Employability and likeability ratings in each condition are presented in Fig. 1a, b, respectively. Candidates in fluent videos were rated as more hirable (M = 6.91, SD = 1.46) than candidates from disfluent videos (M = 6.31, SD = 1.69), BF10 = 5.62, Cohen’s d = 0.42. Responses to the likability question for fluent videos (M = 6.95, SD = 1.60) compared to disfluent videos (M = 6.77, SD = 1.74) supported the null hypothesis, BF10 = 0.17. In short, experiment 1 demonstrated an AV quality bias: candidates from disfluent videos were rated as less hirable.