Does the “surprisingly popular” method yield accurate crowdsourced predictions?

Rutchick, Abraham M.; Ross, Bryan J.; Calvillo, Dustin P.; Mesick, Catherine C.

doi:10.1186/s41235-020-00256-z

Brief report
Open access
Published: 11 November 2020

Does the “surprisingly popular” method yield accurate crowdsourced predictions?

Abraham M. Rutchick ORCID: orcid.org/0000-0003-3402-0934¹,
Bryan J. Ross¹,
Dustin P. Calvillo² &
…
Catherine C. Mesick¹

Cognitive Research: Principles and Implications volume 5, Article number: 57 (2020) Cite this article

2901 Accesses
2 Citations
1 Altmetric
Metrics details

Abstract

The “surprisingly popular” method (SP) of aggregating individual judgments has shown promise in overcoming a weakness of other crowdsourcing methods—situations in which the majority is incorrect. This method relies on participants’ estimates of other participants’ judgments; when an option is chosen more often than the average metacognitive judgments of that option, it is “surprisingly popular” and is selected by the method. Although SP has been shown to improve group decision making about factual propositions (e.g., state capitals), its application to future outcomes has been limited. In three preregistered studies, we compared SP to other methods of aggregating individual predictions about future events. Study 1 examined predictions of football games, Study 2 examined predictions of the 2018 US midterm elections, and Study 3 examined predictions of basketball games. When applied to judgments made by objectively assessed experts, SP performed slightly better than other aggregation methods. Although there is still more to learn about the conditions under which SP is effective, it shows promise as a means of crowdsourcing predictions of future outcomes.

Significance statement

When judgments are combined, the result is often more accurate than the individual judgments by themselves—the “wisdom of the crowd” phenomenon. For example, the average estimate (in a classic case, of the weight of an ox) is more accurate than most individual estimates, and the Las Vegas point spread, which is driven by gamblers’ collective decisions, often accurately predicts the winners of games. Sometimes, however, the majority is wrong. For example, most people erroneously believe that Los Angeles, California, is west of Reno, Nevada, and most people incorrectly predicted the 2016 US Presidential election. A recently developed approach, the “surprisingly popular” method, constructs group predictions such that minority opinions can influence the collective choice. When people are correct but in the minority, they often know that many others do not know the correct answer. This knowledge can be leveraged by asking people to make one additional judgment: the percentage of other participants who will make the same judgment they did. Integrating this judgment into the group decision allows the minority choice to sometimes be selected. However, this method has usually been applied to factual judgments, such as knowledge of state capitals, and only rarely has examined prediction of future events. We applied this method to predictions of football games, US elections, and basketball games. We found that the “surprisingly popular” method indeed yielded the most accurate collective predictions, but only when the people making the predictions were knowledgeable about the subject.

Introduction

Aggregations of judgments often outperform those of individuals. This phenomenon, often termed “the wisdom of crowds” (Surowiecki 2005), has been shown in many decision and prediction contexts, including mathematical problems (Yi et al. 2012), game shows (Lee et al. 2011), and elections (Gaissmaier and Marewski 2011). However, most crowdsourcing methods have an important limitation—they cannot detect cases in which the majority is wrong.

One recently developed aggregation approach, the “surprisingly popular” method (Prelec et al. 2017; henceforth “SP”), has shown promise in overcoming this weakness. The SP method leverages metacognitive awareness: people who are correct, but in the minority, often know that their response is rare. Participants answer one additional question: the percentage of other participants who will make the same judgment they did. These estimates are then compared to participants’ actual judgments. When an option is chosen more often than the average metacognitive judgments of that option, it is “surprisingly popular” and is selected by the method.

For example, suppose participants are asked whether Reno, Nevada, is east of Los Angeles, California. Because most of Nevada is east of most of California, people often respond that Reno is east of Los Angeles. This is incorrect; Reno is some 86 miles west of Los Angeles. Suppose that 30% of people know this. They also—importantly—know that this knowledge is rare, and estimate, on average, that 15% of others are also correct. Now consider the 70% of people who are incorrect; suppose that they believe, on average, that 90% of others agree with their answer. Thus, although the average metajudgment was that only 11.5%^{Footnote 1} of people believe that Reno is west of Los Angeles, that answer was actually given by 30% of respondents, making it “surprisingly popular.”

Most demonstrations of the SP method have examined judgments in which the correct answer is known. Although improving the accuracy of such judgments may inform understanding of judgments about as-yet-unsolved questions, it does not necessarily follow that improvements in problem solving imply improvements in prediction. Leveraging the SP method to improve prediction of future events is a particularly exciting potential application of this approach.

Lee et al. (2018) provided the first test of whether the SP method can improve collective judgments of unknown events—that is, future outcomes. Lee et al. (2018) had participants predict the winners of National Football League (NFL) games in the 2017 season. They found that, among participants who indicated that they were “extremely knowledgeable” about football, the SP method yielded better predictions than many NFL media figures, an alternative aggregative method (confidence-weighted judgments), and a prominent algorithmic approach to prediction (by fivethirtyeight.com). However, SP was inferior to the democratic method (the modal judgment). Given these mixed results, Lee et al. (2018) were appropriately cautious in their conclusions. First, they noted that participants were capable of easily making metacognitive judgments about future events, as they are in the case of factual judgments. Second, they emphasized the importance of expertise in yielding accurate predictions using the SP method. However, several important questions remain unanswered.

First, does the SP method actually yield more accurate predictions than other aggregation methods? Examining Lee et al. (2018), the most straightforward implication is that SP does not clearly outperform other approaches. Nevertheless, it may be that the particular NFL season examined by Lee et al. (2018) is not representative of future events, sporting events, or even NFL seasons. Thus, it remains useful to provide additional tests of the SP method.

Second, does the SP method perform better when it aggregates judgments made by experts? Prelec et al. (2017) did not find systematic differences in the effectiveness of the SP method based on expertise. In contrast, Lee et al. (2018) found that the SP method was more effective when it aggregated only judgments made by self-assessed experts. However, this selection decision was exploratory (Lee et al. 2018, p. 326), and moreover, self-assessments of expertise are not always accurate (Kruger and Dunning 1999).

To examine these questions, we conducted three studies in which participants predicted future outcomes. We compared the SP method to other methods of aggregating crowdsourced judgments and also assessed expertise by testing domain knowledge. Study 1 examined predictions of NFL games made by students, Study 2 examined predictions of the 2018 midterm elections made by mTurk workers, and Study 3 examined predictions of NBA games made by members of the /r/NBA and /r/sportsbook subreddits and students in a sport psychology course. We hypothesized that the SP method, when applied to judgments made by experts, would yield more accurate forecasts than other crowdsourcing approaches.

Study 1

In Study 1, we replicated Lee et al. (2018), with two important refinements. First, we preregistered our procedure for selecting experts and our decision to analyze experts separately (https://osf.io/u9k72/; preregistrations, materials, and data for all studies can be found there). Second, we included an objective method of assessing expertise.

Method

Participants

Participants were recruited from a psychology course and were compensated with extra credit. All participants (N = 227) completed a survey at the outset of the NFL season; 205 made at least one prediction. The maximum number of participants in a week was 161; the minimum was 121.

Materials and procedure

All participants completed a survey with demographics, a self-evaluation of NFL knowledge on a 5-point scale (from “not knowledgeable at all,” to “extremely knowledgeable,” per Lee et al. 2018), and a 31-question NFL knowledge questionnaire (based on Van Overschelde et al. 2005); the questionnaire is available on the OSF page.

Each Tuesday during the NFL season, participants received a survey presenting that week’s NFL games in chronological order. Participants predicted the winner of each game, indicated their confidence in that prediction on a 1 (guess) to 5 (very high confidence) scale, and estimated the percentage of other participants who agreed with their prediction.

Differences from Lee et al. (2018)

As noted above, Study 1 included an objective measure of expertise. Study 1 also differed from Lee et al. (2018) in two other ways. First, it used a student sample rather than mTurk workers. Second, it examined only part of the 2018 NFL season (weeks 1 through 15) due to the end of the semester during which student participants were available, rather than the entire 17-week 2017 NFL season. For eight games, the incorrect team was listed as the home team; these were excluded, yielding 216 which were analyzed.

Results and discussion

Aggregation approaches

Predictions were aggregated using three methods (following Lee et al. 2018). First, the democratic method selects the team who was predicted to win by the most participants. Continuing the example discussed previously, because 70% of people believed that Reno is east of Los Angeles, the democratic method would select this (incorrect) response. Second, the confidence-weighted method multiplied each prediction by its associated confidence and selected the team with the highest weighted prediction count. Suppose that, in the Reno/Los Angeles example, the 30% of people correctly responding that Reno was west of Los Angeles had a mean confidence of 4.5 in their choice, whereas the remaining 70% had a mean confidence of 3.5. Here, the confidence-weighted method would select the incorrect response.^{Footnote 2} Last, the surprisingly popular (SP) method, described previously, identified the option chosen more often by participants than it was estimated to be chosen. When a method resulted in a tie, 0.5 correct predictions were awarded.

In addition to the aggregation approaches, the predictions made by members of the media (as recorded by nflpickwatch.com) were recorded. The predictions made by fivethirtyeight.com, which were generated algorithmically, were recorded as an additional point of comparison.

Whole-sample analyses

Individual participants’ predictions were correct 52.8% of the time. The democratic method outperformed individual predictions only narrowly (53.9% correct; Table 1); the confidence-weighted method performed better (55.6%), and the SP method performed worse (51.4%). As shown in Fig. 1, the three aggregative methods made quite similar predictions, deviating little from one another. Conversely, the algorithmically generated predictions made by fivethirtyeight.com differed strikingly from the crowdsourced methods and were more accurate (62.0%).

Table 1 Performance (correct predictions, % correct) across aggregation methods

Full size table

Self-assessed experts

Following the approach used by Lee et al. (2018), which we preregistered, participants who rated their knowledge of NFL football as “extremely knowledgeable” were considered self-assessed experts. There were only 6 such participants (2.6% of the sample); the number who made predictions ranged from 3 to 5. Because there were too few self-assessed experts to produce reliable crowdsourced predictions, we report results for this sample in Additional file 1.

Objectively assessed experts

Per the preregistration, participants scoring in the top quintile of the knowledge quiz were considered objectively assessed experts; 46 participants scored above the 80th percentile (22/30), with the number making predictions each week ranging from 27 to 37. Aggregations of objectively assessed experts outperformed the overall sample (and the subsample of self-assessed experts) regardless of the method used. This improvement in performance ranged from 9 games (4.2%) for the confidence-weighted method to 19 games (8.8%) for the SP method. Comparing across aggregative methods, the SP method performed best, although only narrowly. As shown in Fig. 2, the SP method made predictions that were relatively distinct from those made by the other two crowdsourced methods (which were quite similar to one another). As was the case in the whole sample, fivethirtyeight.com’s algorithmic predictions differed sharply from all three crowdsourced methods. Although the SP method was the most successful crowdsourced method, it was also the most distinct from the algorithmic method.

In sum, the SP method did not perform well when used to aggregate the judgments made by the total sample. When applied to an objectively assessed expert subsample, SP was the best-performing method. All of these methods were outperformed by the predictions of media members covering the NFL, the democratic aggregation of those predictions, and the modeling-based approach of fivethirtyeight.com.

Study 2

Study 2 examined the SP method in another domain: US elections. As in Study 1, we conducted parallel analyses on subjectively and objectively defined expert subsamples. The study was preregistered.