Mathematically aggregating experts’ predictions of possible futures

Structured protocols offer a transparent and systematic way to elicit and combine/aggregate, probabilistic predictions from multiple experts. These judgements can be aggregated behaviourally or mathematically to derive a final group prediction. Mathematical rules (e.g., weighted linear combinations of judgments) provide an objective approach to aggregation. The quality of this aggregation can be defined in terms of accuracy, calibration and informativeness. These measures can be used to compare different aggregation approaches and help decide on which aggregation produces the “best” final prediction. When experts’ performance can be scored on similar questions ahead of time, these scores can be translated into performance-based weights, and a performance-based weighted aggregation can then be used. When this is not possible though, several other aggregation methods, informed by measurable proxies for good performance, can be formulated and compared. Here, we develop a suite of aggregation methods, informed by previous experience and the available literature. We differentially weight our experts’ estimates by measures of reasoning, engagement, openness to changing their mind, informativeness, prior knowledge, and extremity, asymmetry or granularity of estimates. Next, we investigate the relative performance of these aggregation methods using three datasets. The main goal of this research is to explore how measures of knowledge and behaviour of individuals can be leveraged to produce a better performing combined group judgment. Although the accuracy, calibration, and informativeness of the majority of methods are very similar, a couple of the aggregation methods consistently distinguish themselves as among the best or worst. Moreover, the majority of methods outperform the usual benchmarks provided by the simple average or the median of estimates.

The authors consider the problem of aggregating experts' probability predictions of a future binary outcome. They consider many different aggregators that derive from the weighted average and differ in the way the weights are calculated. They apply the aggregators to three real-world datasets and compare the performances of the aggregators in terms of several well-established criteria. Even though no single aggregator emerges as a clear winner, the beta-transformed arithmetic mean (BetaArMean) performs very well.
Overall, I enjoyed reading the paper, and I believe the literature would benefit from an extensive comparison of different weighting schemes. However, the paper needs work before being published. I have listed below both general and specific comments. I hope they help the authors to improve this paper.

General Comments
1. Many of the weighting schemes are stated without providing any rationale or evidence from previous literature that such a scheme could be reasonable. The authors should put in more effort to explain the intuition behind each scheme and, whenever possible, cite past work that has argued for such a scheme.
2. The framing needs work. Currently the text is 33 pages long and feels like a long enumeration of different weighting schemes. Furthermore, the discussion and results only take up around 5 pages. Perhaps the following could help: (a) Introduction: Include a paragraph or two clearly stating what the goal of this study is. One idea is to state that unequal weights based on seed variables have been found to improve averaging in the context of point forecasts. This article now studies whether the result holds under probability forecasts. This is roughly what you already state in the beginning of Section 3. This statement, however, should come much earlier in the paper.
(b) Discussion: What is the main take-away or discovery in this paper? If you decide to use the study goal proposed above, then you should link the results and discussion to that. For instance, can you say anything about why different weighting schemes make little difference in the context of probability predictions? What makes this inherently different from point predictions? Can you propose a solution for making the probability forecasting context more like the point forecasting context?

Specific Comments
1. Line 75: The authors could consider the work on the Wisdom of the Inner Crowd that has found improved accuracy from averaging multiple predictions from the same individual. See, e.g., Herzog and Hertwig (2014).
2. Line 151: There is a typo: "outcomr" 3. Line 151: The authors could give context by stating that a score of 0 implies perfect accuracy and 0.5 is guaranteed by constantly predicting 0.5 for all future events.
4. Lines 174-175: The authors mention "arbitrary probability threshold." Can you explain what this means? The AUC requires a threshold (e.g., 0.5) above which the event is taken to occur and below which it is taken to not occur. How is this done exactly in your case?
6. Lines 251-255: The authors claim that more seed variables are needed when probabilities are elicited. Can you provide intuition for this? My intuition says the opposite: by asking the experts' for their full distribution -not just a point estimate -I gain more information about their true ability per prediction. This suggests that I would need fewer -not more -seed variables.
7. Lines 269: The phrase "In formulae where we single out claim c, we will have claims indexed by d, for d = 1, . . . , C" is confusing. First you index claims by c, but then also mention d. Perhaps a simple (one sentence) example can help here to clarify.
8. Lines 275: The authors should consider adding a table with summary statistics of the data. For instance, this could include the number of claims, number of experts, number of predictions per claim, the number of claims that were true (a.k.a., the base rate).
9. Section 2.9: The weighted average of probability prediction is known to lack both calibration and sharpness even if the individual probabilities are calibrated. Ranjan and Gneiting (2010) show this for fixed weights. Satopää (2017) generalize the result to random weights. This sub-optimality then led to the development of the extremizing algorithms (see, e.g., Satopää et al. 2014;Baron et al. 2014). The authors should, in particular, review Baron et al. (2014) because it can help them motivate the median and some of the other aggregators. Now, these sub-optimality results are theoretical and they do not imply that the weighted average can never be useful. There are papers arguing the equally weighted average is "robust" in the sense that it does not rely on unstable parameter estimation (e.g., Jose and Winkler 2008). As a result, it can be useful. Finally, it is interesting that the authors do not cite Ranjan and Gneiting (2010) even though, I believe, this is where the BetaArMean aggregator was originally proposed. 23. Move Section 4 before Section 2. This way you can use the data as a motivation, and the descriptions of the aggregators would make more sense as well.
24. Have the authors made the repliCATS and ACE-IDEA datasets publicly available? I did not see a mention of this in the paper.
25. Line 622: The GJP also considered teams (both trained and untrained) and superteams. Each team had around 5 individuals. In the publicly available GJP dataset, the untrained teams are indicated by condition 4a; the untrained teams with condition 4b; and the superteams with condition 5. Perhaps it makes more sense to analyze these groups rather than "random groups" of 10 individuals.
26. In the ACE tournament the experts were allowed to make and update their predictions until the events resolved. Of course, predicting an event 3 months from now is much harder than something that will happen tomorrow. How did the authors take into account the different time horizons of the predictions? As a suggestion, they may consider using predictions made 30 days before the event resolutions, as was done in Satopää et al. (2021).
27. Figure 1: Could the authors comment on the statistical significance of these differences? For instance, the repliCATS is a very small dataset. Therefore it is likely that none of these differences are statistically significant.
28. There are some inconsistencies in the discussion. For instance, on line 673 the authors state that "there are very few reasons to proclaim one aggregation method better than another." But then on line 690 the BetaArMean is declared as a "clear winner." 29. Lines 712-720: I liked this discussion on the median. There are several papers arguing that the median performs well. For instance, consider Hora et al. (2013) and Han and Budescu (2019). Perhaps these references can help the authors to strengthen this result.
30. An interesting future study could compare the different types of augmented elicitation schemes where the decision-maker asks the experts to provide more than just the prediction of the outcome.