How the wisdom of crowds, and of the crowd within, are affected by expertise

Fiechter, Joshua L.; Kornell, Nate

doi:10.1186/s41235-021-00273-6

Brief report
Open access
Published: 05 February 2021

How the wisdom of crowds, and of the crowd within, are affected by expertise

Cognitive Research: Principles and Implications volume 6, Article number: 5 (2021) Cite this article

6717 Accesses
8 Citations
5 Altmetric
Metrics details

Abstract

We investigated the effect of expertise on the wisdom of crowds. Participants completed 60 trials of a numerical estimation task, during which they saw 50–100 asterisks and were asked to estimate how many stars they had just seen. Experiment 1 established that both inner- and outer-crowd wisdom extended to our novel task: Single responses alone were less accurate than responses aggregated across a single participant (showing inner-crowd wisdom) and responses aggregated across different participants were even more accurate (showing outer-crowd wisdom). In Experiment 2, prior to beginning the critical trials, participants did 12 practice trials with feedback, which greatly increased their accuracy. There was a benefit of outer-crowd wisdom relative to a single estimate. There was no inner-crowd wisdom effect, however; with high accuracy came highly restricted variance, and aggregating insufficiently varying responses is not beneficial. Our data suggest that experts give almost the same answer every time they are asked and so they should consult the outer crowd rather than solicit multiple estimates from themselves.

The average value of multiple estimates tends to be more accurate than any one single estimate; this phenomenon is known as the wisdom of the crowd (Surowiecki 2004). Galton (1907) published the first demonstration of the wisdom of the crowd. He analyzed responses from a weight-estimation game wherein people were trying to estimate the weight of an ox “after being slaughtered and dressed.” The mean estimate of all participants was 1197 lb; a re-analysis of Galton's notes showed that the correct weight of the ox was 1197 lb, meaning the crowd had perfectly assessed the weight (Wallis 2014).

Subsequent work has extended wisdom of the crowd to geopolitical forecasts (Mellers et al. 2014, 2016, 2017; Turner et al. 2014), probability estimates (Ariely et al. 2000; Lee and Danileiko 2014), ordering problems (e.g., the order of U.S. Presidents; Steyvers et al. 2009), forced-choice questions (Bennett et al. 2018), and tasks involving the coordination of multiple pieces of information, such as picking the most efficient path through a predetermined ordering of points (Yi et al. 2012). Furthermore, crowd wisdom has been observed in populations whose cognitive abilities are more limited than those of human adults, including young adolescents (Ioannou et al. 2018) and nonhuman animals (Ioannou 2017).

Remarkably, the benefits of averaging estimates hold even when those estimates come from the same person; this effect is called the wisdom of the inner crowd (see Herzog and Hertwig 2014a, for a review; see Ariely et al. 2000, for boundary conditions on the inner crowd). For example, Vul and Pashler (2008) asked participants eight general knowledge questions, all of which required an estimate of a percentage (e.g., What percentage of the world's airports are in the United States?). Participants were then unexpectedly asked all eight questions again, either immediately or three weeks later. The average of both guesses was more accurate than either the first or second guess alone, especially for the participants who waited three weeks between guesses.

The wisdom of the inner crowd has been observed with percentage estimation (Fraundorf and Benjamin 2014; Herzog and Hertwig 2014b; Hourihan and Benjamin 2010; Müller-Trede 2011; Steegen et al. 2014), numerical general knowledge estimation (Rauhut and Lorenz 2011; but see Müller-Trede 2011), date estimation (Herzog and Hertwig 2009; Müller-Trede 2011), and quantity estimation (i.e., guessing the number of objects in a container; van Dolder and van den Assem 2017). The benefits of delaying a subsequent guess have also replicated (Steegen et al. 2014; van Dolder and van den Assem 2017).

Crowd variance and crowd wisdom

Following previous studies (e.g., Page 2007; Rauhut and Lorenz 2011; van Dolder and van den Assem 2017) we will focus on three derived values to assess crowd wisdom: (1) bias, or the squared distance from the crowd’s mean to the true value; (2) mean squared error (MSE), or the average squared distance from each estimate and the true value; and (3) variance, or the average squared distance from each estimate and the crowd’s mean (see Table 1 for an example of how these values are calculated). Bias indicates the error of a crowd and MSE indicates the error of an average individual estimate; thus, crowd wisdom can be defined as MSE − bias.^{Footnote 1} Page (2007) demonstrated that variance = MSE − bias. He called this fact the diversity prediction theorem: the wisdom of a crowd is determined by the variance of its responses.

Table 1 Example of two crowds, each comprised of three estimates

Full size table

The diversity prediction theorem (Page 2007) provides a convenient conceptualization of the findings discussed so far. First, inner- and outer-crowd wisdom will be evident so long as estimates vary to a sufficiently large degree. Second, the benefit of spacing estimates from the inner crowd (e.g., Vul and Pashler 2008) arises from the fact that estimates will be less correlated, and therefore more varied, when more time has passed between those estimates. We tested an additional implication of the diversity prediction theorem that has received no previous empirical testing (but see Hong and Page (2004), for relevant simulations): crowd wisdom might suffer under conditions in which people have expertise. The reasoning behind this claim is that experts may tend to rely on the same information, either between or within individuals, and therefore will produce an insufficiently varied set of estimates.

The present experiments evaluated expertise and the wisdom of the inner and outer crowds in a novel numerosity estimation task (adapted from Kornell and Hausman 2017). We chose this task for multiple reasons: First, people tend to produce inaccurate estimates in such tasks (Minturn and Reese 1951), primarily underestimating the number of items displayed (Indow and Ida 1977; Izard and Dehaene 2008; Krueger 1982, 1984); second, people can be quickly trained to calibrate their estimates (Izard and Dehaene 2008; Krueger 1984; Lipton and Spelke 2005); third, regarding the inner crowd, it is possible to ask the same question multiple times without a long delay by showing the same number of items but arranging them in different configurations. We hoped that these properties would enhance our prospects of observing the effect of expertise on the inner and outer crowd. In Experiment 1 we evaluated whether the wisdom of the inner and outer crowd extended to our novel task. In Experiment 2 we asked whether crowd wisdom persisted after we made our participants experts at the task via training trials.

Experiment 1

The goal of Experiment 1 was to replicate the wisdom of the inner and outer crowd in our novel numerosity estimation task.

Methods

Participants

Participants were 63 people recruited from Amazon's Mechanical Turk service. All participants were paid $1.00 to complete the experiment; pay did not reflect performance on the task. Previous attempts at replicating laboratory findings on Mechanical Turk have generally been successful (e.g., Crump et al. 2013) and so we felt that our participants would be motivated to perform well even with a flat pay rate. We collected data from 70 people, anticipating that we would obtain usable data from approximately 60 of them.^{Footnote 2} We did not analyze data from participants who began the experiment multiple times, did not report being fluent in English, reported experiencing technical difficulties, or reported having seen our stimuli before.

Procedure

Participants viewed a box containing asterisks (*) on a computer screen (see Fig. 1). They completed 60 trials in total, ten trials each of six different set sizes (50, 60, 70, 80, 90, or 100 stars). The order of these 60 trials was randomly determined for each participant and the positioning of the stars was randomly determined on each trial. Participants viewed the star-filled box for 2 s; the box was then removed from the screen and participants were asked to estimate the number of stars present in the box.

Dependent variable

Response accuracy is typically measured using mean squared error (i.e., MSE) and squared error (i.e., bias). In this study, however, errors tended to be larger for larger set sizes. To eliminate this noise we converted the error of the estimates into a proportion of the true value being estimated (Rauhut and Lorenz 2011; van Dolder and van den Assem 2017). For example, a response of 55 or 45 when there were 50 stars was coded as 0.10 or − 0.10, respectively. We calculated our dependent variable, mean squared proportional-error (MSE_P),^{Footnote 3} based on these proportions. The MSE_P would be 0.01 for both 0.1 and − 0.1. We did not log-transform participants' responses, as previous studies have done with numerical estimates (Rauhut and Lorenz 2011; van Dolder and van den Assem 2017). This decision arose from the fact that participants' responses were skewed in Experiment 1 but not in Experiment 2; we therefore elected to not transform responses for either experiment in order to keep the results from our experiments compatible with one another.

Constructing the inner crowds

Inner crowds were compiled by aggregating across the first through tenth estimates, separately for each set size, within each individual.

Constructing the outer crowds

Outer crowds were generated by selecting each of our 63 participants and randomly grouping them with 9 other participants. This process gave us 63 crowds of 10 people. The estimates were aggregated by set size within each crowd; only first estimates from each set size were used in the outer crowds. The order in which estimates were added to the aggregate was randomly determined, with the constraint that participants would serve as the first guess in the one crowd for which they were systematically determined to belong to; this random order was consistent across set sizes for each crowd. (We chose to aggregate in a random order because there was no principled means of ordering participants; by contrast, for the inner crowd, responses were aggregated in chronological order.)

Data analysis

Rather than calculate values of variance and bias directly from the data, as is done in Table 1, we instead estimated those values by fitting a mixed-effects nonlinear model to each crowd type and assessing the resulting parameter estimates. Specifically, we fit the parabolic function a/t + b to our observed values of MSE_P (see Rauhut and Lorenz 2011), where t is the number of estimates being aggregated (which is also the trial number for the inner crowd), a is the estimated variance of a set of responses, and b is the estimated bias (i.e., the asymptotic performance) of a set of responses. Note that in our analyses estimates of a and b are in terms of squared proportional deviance because that is the scale of our dependent variable, MSE_P. Because crowd wisdom is equal to the variance of a crowd's responses, a also serves as an estimate of crowd wisdom (Page 2007). For both parameters, we included group-level effects for each participant (or outer crowd) and each set size. We furthermore allowed for a nested structure between set sizes and participants (or outer crowds) to reflect the fact that each set size was estimated multiple times by a given person (or outer crowd).

We fit these models using Bayesian parameter estimation in the "brms" package in R statistical software (Bürkner 2019). We placed a half-normal prior with a mean of 0 and a standard deviation of 0.5 on the a and b parameters. We used bounded priors because neither parameter would be interpretable if it were estimated to be negative.

We used Bayesian hypothesis testing to analyze the population-level parameter estimates from our model. Specifically, we obtained Bayes factors by calculating a Savage-Dickey ratio, which is the ratio of the zero-point-densities of the posterior and prior distributions for a given parameter (Wagenmakers et al. 2010). We will report Bayes factors in terms of the alternative hypothesis, BF₁₀. We will consider evidence convincing when values are either 3 or greater (for the alternative) or 0.33 or less (for the null).

Results

Before analyzing the data from Experiment 1, we removed all estimates that were at least one order of magnitude greater or smaller than the correct answer (e.g., an estimate of 5 or 500 when viewing 50 asterisks), because they seemed more likely to be a typo than a sincere estimate. Only 12 of 3780 estimates were removed.

The population-level parabolas estimated by our mixed-effects models are presented in Fig. 2a. Parameter estimates from the models are presented in Table 2. We found very strong evidence in favor of nonzero values of a and b for both the inner and outer crowd. These findings mean, respectively, that both crowd types benefitted from response aggregation and both crowd types were biased away from the true values being estimated.

Table 2 Means and standard deviations of posterior distributions from our two-parameter nonlinear model

Full size table

Discussion

Experiment 1 demonstrated that the wisdom of the inner and outer crowd extended to our numerosity estimation task. To our knowledge, this experiment is the first demonstration of inner-crowd wisdom for numerosity estimation in the context of an experimentally controlled design (see van Dolder and van den Assem 2017, for an observational study). It appears that individuals rely on a process akin to sampling from an inner distribution when making numerosity judgments, thereby allowing those judgments to benefit from estimate aggregation—in this case, even without a long delay between estimates (see Vul and Pashler 2008).

Experiment 2

In Experiment 2, we sought to extend our findings by assessing the impact of expertise on crowd wisdom. We did so by giving our participants a short set of training trials prior to beginning the critical trials. Numerosity estimation tends to become substantially more accurate after training (e.g., Izard and Dehaene 2008). However, training might also overly constrain the variance of estimates. This restricted variance could result in redundant information being added to the aggregate, in which case crowd wisdom would subsequently suffer.