Characterising Uncertainty in Expert Assessments: Encoding Heavily Skewed Judgements

When limited or no observed data are available, it is often useful to obtain expert knowledge about parameters of interest, including point estimates and the uncertainty around these values. However, it is vital to elicit this information appropriately in order to obtain valid estimates. This is particularly important when the experts’ uncertainty about these estimates is strongly skewed, for instance when their best estimate is the same as the lowest value they consider possible. Also this is important when interest is in the aggregation of elicited values. In this paper, we compare alternative distributions for describing such estimates. The distributions considered include the lognormal, mirror lognormal, Normal and scaled Beta. The case study presented here involves estimation of the number of species in coral reefs, which requires eliciting counts within broader taxonomic groups, with highly skewed uncertainty estimates. This paper shows substantial gain in using the scaled Beta distribution, compared with Normal or lognormal distributions. We demonstrate that, for this case study on counting species, applying the novel encoding methodology developed in this paper can facilitate the acquisition of more rigorous estimates of (hierarchical) count data and credible bounds. The approach can also be applied to the more general case of enumerating a sampling frame via elicitation.


Introduction
This paper addresses the problem of eliciting from an expert both point and interval estimates of a parameter of interest, where the expert data is heavily skewed. The use of information elicited from experts to complement and inform statistical analyses is growing in popularity, particularly with the development of rigorous methods for obtaining such information. Elicited estimates of variables of interest may be useful, for example, when data gaps exist, such as in the post-Normal science situation in which inference is required before the data are available The known component may need to be disaggregated into sub-components K jk across several categories, which may reflect separate database records for sub-species or research studies. The discovered component can be disaggregated into D ji , i = 1, 2, 3 representing (1) cryptic, (2) genetically distinct, and (3) morpho-species in that taxon. The unknown sub-components U jℓ , i = 1, 2, 3 may arise due to: (1) sampling inefficiency, when species that have not been collected in areas previously sampled; or species that remain undiscovered because (2) particular locations or (3) habitats that remain relatively or completely unsampled (see [7] for further details).
In addition to point estimates of D ji and U jℓ , it is also important to capture interval estimates, reflecting the expert assessment of the plausible range of values for each component. Skewness in the interval assessments may be induced by the hierarchical structure of the counts. Some researchers are considered 'splitters' and tend to over-differentiate species, which at some later date may be resolved via aggregation, hence reducing the apparent richness. Thus the current reported count may provide an upper bound on the number of species in the taxon, leading to elicitation of positive skewness in the plausible range of species richness. Conversely, other researchers are considered 'lumpers' and may under-differentiate species, which will later lead to an apparent increase in richness. So current reported counts may provide a lower bound on the number of species in the taxon, leading to negative skewness in the plausible range of richness. Negative skewness could also arise, for example when individual species within a large taxon such as Family or Order have been assigned more than one name by different taxonomists. More extreme skewness (positive or negative) could also arise if the best estimate given by a taxonomist coincides with the known published number [11].
There are many analogues of this type of problem. For example, estimation of the population size of a country involves estimation of subgroups, some of which are more difficult to enumerate using standard census techniques, such as the number of homeless people e.g. [14,15] or minority people in remote areas [16]. Although some methods exist for obtaining ballpark values for these subgroups e.g. [17], experts could also provide informative estimates of the required enumerations. With continued divisions of the subgroups, the associated expressions of uncertainty would be expected in at least some situations to become more skewed.
This setup also extends more generally to that of estimating the size of a sampling frame in which the population is hierarchically stratified. The above decomposition into components, which are estimated with different levels of uncertainty, provides an obvious mechanism for compiling an overall estimate of the size of the sampling frame. As indicated above, the hierarchical structure may impose constraints on the aggregation of components. For instance, in a landscape ecology context, ecoregions are hierarchically defined, and since these form a nonoverlapping partition of the space, the number of regional ecosystem units within an ecoregion must 'add up'. In the species richness case study, this hierarchy is an inherent feature of taxonomic classes of classification, since all species belong to a series of higher taxonomic groups, from genus to kingdom.
Positive skewness in elicited values has been captured using the lognormal e.g. [18,19], Poisson [20], negative binomial [21] and Beta e.g. [22,23] distributions, as well as the quantiledefined Davies distributions [24]. Although they can be easily generalised to other data types, the above problems involve the elicitation and aggregation of point and interval estimates of counts, so a natural choice might be a distribution such as the Poisson, Negative Binomial or Multinomial. Single parameter distributions such as the Poisson make it difficult to separate the expert's best estimate from their uncertainty. Indeed, in our experience there are difficulties in using any of these distributions for elicitation of highly skewed intervals because of the nonindependence of the mean and variance [25]. Thus in this paper we consider the more flexible lognormal and mirror lognormal distributions to describe both left and right asymmetry, or alternatively the normal distribution for (reasonably) symmetric estimates. Despite the relationship between the mean and variance, the Beta distribution provides a very flexible shape that is capable of capturing high skewness. For this reason, we also compare the set of distributions based on the normal with the scaled Beta distribution. To our knowledge, the mirror lognormal distribution has not previously been used in the context of elicitation. We explore their potential to encode highly skewed expressions of the plausible range of values expressed by experts.
Eliciting expert opinion and encoding these values using a Beta distribution is a common approach; see, for example, the literature reviews in [22] and [23]. There are various software packages and routines that perform this encoding. We discuss here three examples, the R prevalence package, Elicitator [26] and the MATCH Uncertainty Elicitation Tool [27].
The functions betaExpert and betaPERT within the R prevalence package can fit 2and 4-parameter Beta distributions respectively. The expert provides a minimum value, most likely estimate (mode) or mean and maximum value. For the betaExpert function the lower and upper numbers must lie between 0 and 1, whereas for betaPERT the range can be anywhere on the real line. The betaExpert function uses least squares to estimate α and β. When the best guess is set to zero or one (and thus the expert only provides either an upper or lower estimate), then the corresponding α or β is set to one and least squares is used to find the unknown parameter. For the 4-parameter Beta distribution (betaPERT function) the α and β are calculated using the simple estimates proposed by [28]; see also [29]. Both of these functions allow the elicited best guess to be equal to the minimum or maximum value.
Morris et al. [27] developed a web-based elicitation tool called the MATCH Uncertainty Elicitation Tool. This tool provides five elicitation methods to estimate the expert's probability distribution for a variable X, namely the roulette method (based on the proportion of chips placed in bins spanning the range of X), the quartile method (based on eliciting the median, lower quartile and upper quartile of X), the tertile method (based on eliciting the tertiles of X), the probability method (based on eliciting three probabilities such as P(0 < X < 0.25), P(0.75 < X < 1), P(0 < X < 0.5) if X is between 0 and 1) and the hybrid method (based on eliciting the median and two probabilities). Six distributions are available for describing the elicited values, namely the normal, t, beta (scaled if the limits are not 0 and 1) gamma, log normal and log t. The software reports which distribution best fits the judgement; highly skewed judgements can lead to choice of the (scaled) Beta distribution. In the methodology underlying the MATCH tool, it was assumed that the elicited median would not equal either the elicited minimum or maximum, and hence does not allow for extreme skewness.
James et al [26] developed an elicitation software called Elicitator, which quantifies expert knowledge and incorporates this into a prior model in Bayesian regression. This tool can be applied to the problem of asking the expert about the probability of presence at a number of sites/observations with known covariate values (such as habitat and environmental variables). For each covariate, the expert provides the lower and upper bounds (99 or 95% credible interval), then specifies another interval such as the mode. These values are encoded into the Beta distribution using iteratively reweighted least squares. This paper will describe a methodology for capturing highly positively or negatively skewed counts. In our previous work [7], Normal and lognormal distributions were encoded when the expert provided answers as counts or multiplicative factors, and Beta distributions were encoded to elicited percentages. The details of encoding these distributions for elicited data were not specified in Fisher et al [7]. In this paper, we provide these details and also present an alternative distribution, the scaled Beta, which can be encoded to all elicited numbers. In addition, we address the overall problem of aggregating elicited counts across strata while respecting the laws of conservation of counts within hierarchies. More detailed evaluation and comparison of different modelling choices is provided in this paper, which could not be examined in detail in Fisher et al [7]. This comparative analysis will investigate the choice of statistical distributions for the type of extreme cases encountered in the case study data. The methods are applied to the reef biodiversity case study, in which we are estimating the total number of species over a broad range of taxonomic groups.

Elicitation protocol
The choice of summary statistics used to describe the distribution and the order in which these statistics are elicited are both important considerations [3,30,31]. Following the recommendations of these authors, we adopted an 'outside-in' protocol for elicitation, which starts by eliciting a plausible interval of values, and ends by eliciting a point estimate. Under our protocol, the following information was sequentially obtained: (i) extrema, typically as absolute lower and upper limits, (ii) quantiles at fixed cumulative probabilities, as realistic lower and upper limits, and (iii) the mode, as the most plausible value. So for an initial estimate, the summary statistics elicited were, in order [7]: (i) minimum (M L ) and maximum (M U ), so that the expert is 100% sure that the number is within this range; (ii) more realistic lower (L) and upper (U) bounds, and the uncertainty/sureness (P) around these bounds, elicited by asking how sure they are that the real number lies within these bounds; and (iii) their best estimate (B), the number they thought was most plausible. Following [24], B was encoded as a mode, since this value was elicited using a phrase such as, "what is the most likely value?" A further statistical decision is the unit of elicitation. In the case study, we allowed the expert to choose to provide answers using one of three units. For an absolute estimate, a count would indicate the number of actual additional species that would add to the previous hierarchy's estimate of the total number of species on coral reefs, whereas a relative estimate could be provided via either a percentage (of the number of named species) or a multiplicative factor (e.g. twice as many as the named species). This was designed to allow the expert to provide information in units with which they are most comfortable [32].

Statistical encoding of the elicited information
When encoding the elicited values into statistical distributions, it is useful to be able to capture both uncertainty as well as best estimates. This requires that a distribution has at least two parameters [24]. For example, in the case study, depending on the taxonomic group, regardless of the best estimate of the number of species in a taxon, the variance of the elicited number of species can range from very small (the total number of species is known and there is little uncertainty around this number) to very large (the total number of species is unknown and the uncertainty around how small/large this number could be is large).
In the elicitation software developed by [7], the Normal and lognormal distributions were fitted when the expert provided answers as counts or multiplicative factors. However, when the expert answered as a percentage, a Beta distribution was fitted to the corresponding proportion. An alternative and more flexible approach not considered by Fisher et al. [7] is to fit a scaled Beta distribution to all numbers elicited from the expert. This provides a common platform for encoding the full range of cases, from symmetric to extreme left and right skewed elicited values, and has the added advantage of being applicable across a wide range of units of elicitation including the three used in the case study. The upper number is a suitably large real value in practice, where the expert considers that there is negligible chance that it will be exceeded. While Normal/lognormal distributions can be fitted to percentages, these distributions need to be truncated so that the numbers range between zero and 100 percent, or else the percentages would have to be transformed, for example via a log odds transformation.
Below is a description of how the elicited numbers are fitted to Normal, lognormal, Beta and scaled Beta distributions.
Skewness. We consider statistical distributions with two parameters to encode expert assessments comprising three summary statistics (B, L, U) for a specified cumulative probability P. A distribution with a single parameter would not be flexible enough to capture spread, reflecting expert uncertainty, as well as location, representing the best estimate. In contrast, in most cases a distribution with three parameters would fit the limited elicited data too closely, leaving no room for elicitation error. Obvious exceptions to this include the case in which an expert provides a symmetric interval about the mode, in which case three elicited judgements can be fitted perfectly with a two-parameter distribution. This leaves several choices for twoparameter distributions, which can all be constructed to have the same location and similar spread. Skewness provides a useful way to distinguish the differences among these. Since quantiles are pivotal to fitting the expert uncertainty, we consider the quantile-based definition of skewness [33], which reflects symmetry of the encoded quantiles around the encoded median: This intuitive measure of the relative distance of each quantile from the pth quantile was documented at least a century ago by [34] (hence the B subscript). Unlike the moment-based measure of skewness, the quantile based measure is bounded by −1 and +1.
Encoding via least squares. Encoding a two-parameter distribution such as the Normal or Beta is typically addressed via estimating equations to provide a deterministic encoding of 2 parameters from the same amount of data [23]. The least squares fitting approach has been utilized previously by these authors [24][25][26]35] and those of MATCH [27]. To allow for some uncertainty in elicitation, we may slightly expand the estimation problem to encoding two parameters from three pieces of data, hence defining a statistical rather than deterministic approach. Let θ denote a K-dimensional vector of parameters associated with the probability distribution used to encode the expert estimates. For a given distribution, the aim is to find the parameter vector θ that best fits the elicited quantities. Denote the elicited 'data' by D with encoded quantities D(θ) corresponding to their expected values, so that for K = 3: where Q(p; θ) = F −1 (p; θ) is the quantile function (inverse cumulative probability distribution) evaluated at cumulative probability p, with parameter vector θ. As in [24], we assume that the elicited quantities (for a particular expert) are far enough apart that the difference between elicited and encoded values has effectively constant variance ϕ 2 after appropriate transformation. This effectively defines a regression [35]: which implicitly involves ϕ as a parameter for uncertainty in elicitation. A least squares approach minimizes the distance ρ between the elicited and encoded quantities, as measured by the mean sum of squares (MSS): Straightforwardly, we use Euclidean distance with ρ(y 1 , y 2 ) = (y 1 − y 2 ) 2 . This is appropriate when encoding a symmetric distribution such as the Normal. When encoding skewed counts, we use a log-transformation ρ(y 1 , y 2 ) = (log(y 1 ) − log(y 2 )) 2 . When encoding proportions, we use a logit-transformation ρ(y 1 , y 2 ) = (logit(y 1 ) − logit(y 2 )) 2 .
Given that the parameters of the Beta distribution jointly define the mean and variance, and hence the probability intervals, and given that, according to the elicitation protocol the outer limits are elicited before the mode, it is reasonable to place equal weight on the K = 3 summary statistics. Other arguments could be made, of course, for placing differential weights on these statistics, and a sensitivity analysis could be undertaken to assess the impact of such decisions. This was not pursued here.
The least-squares estimator provides a fast and generic approach to estimation, available in real-time, which allows instantaneous verbal or graphical feedback [3,26]. For the coral reef study, the elicitation of the number of species in a taxon was conducted in a single sitting, so it was important to provide instantaneous feedback to experts. To implement a fast and efficient search across parameter space, the function 'optim' in R [36] was used to implement the [37] method to minimise the MSS.
Normal/lognormal Distributions. We use the usual μ-σ parameterization of the Normal distribution: The lognormal provides a distribution of non-negative values that are positively skewed, and can be defined in terms of the Normal distribution: The mirror lognormal distribution is negatively skewed, and is constructed as the mirror image of the lognormal distribution, with reflection around an upper bound U: The classical definition of mirroring is defined in terms of the cumulative distribution function (cdf). In this case, log X * N(μ, σ 2 ) which has cdf F(Á), means that the classical mirrored distribution has cdf 1 − F(Á), which is the cdf of −logX. Here log(L) + log(U) − log(X) has close to the required mirrored distribution, apart from a negligible numerical discrepancy caused by presuming that Pr(log X ! log U) and Pr(log X log L) are small and are of similar magnitude so 'cancel'.
A lognormal, normal or mirror lognormal distribution can be applied if the expert's response is in the form of a count or multiplicative factor. In the software developed by [7], if the expert's data is positively skewed, i.e. if the respondent's best guess B falls below the lower quartile q(0.25), then a lognormal distribution is fit. Similarly if if the expert's data is negatively skewed, i.e. B falls above the upper quartile q(0.75) then a mirror lognormal is fit. Otherwise, if B falls within the inter-quartile range then the normal distribution is fit, together with a lognormal, if B < q(0.5), or the mirror lognormal, if B > q(0.5). Then the MSS is used to select the best distribution.
For a normal distribution (Fig 1), we may encode parameters μ, σ via the least-squares approach detailed above, using the following procedure: 1. Interpret the best estimate as the mode, which in the Normal case also encodes the mean, sô m ¼ B; 2. Use the least squares estimate of the standard deviationŝ for fixedm ¼ B, using the Euclidean distance ρ(y 1 , y 2 ) = (y 1 − y 2 ) 2 for the MSS.
For a lognormal distribution (Fig 1), we use a similar approach 1. Interpret the best estimate as the modeB ¼ e mÀs 2 and use this to constrain the μ − σ-space.
2. Use least squares to estimate the standard deviation, using the log-transformed Euclidean distance, with MSS LN is as specified in detail in S1 Text.
Similarly least squares estimates for μ, σ, U can be obtained for the mirror lognormal distribution (Fig 1).
2. If the expert declares that the best estimate B equals the lower L number (i.e. extreme positive skewness), then set β to one and calculate α. Alternatively if B = U (i.e. extreme negative skewness), then set α to one and calculate β.
3. Use least squares to estimate α and β, based on the logit-transformed Euclidean distance and MSS Beta defined in S1 Text.
Scaled Beta Distribution. All elicited numbers can be fitted to a scaled Beta distribution, irrespective of the units elicited and degree of skewness, by scaling them to the range [0, 1] and fitting the Beta distribution described above. Of course, if a proportion is elicited on the full unit interval, then no scaling is required. However a proportion with minimum and maximum falling within this interval could be fit to a scaled Beta distribution. Thus the only additional step is the scaling of units elicited as numbers of multiplicative factors. Although the expert's values of M L and M U could be used to effect this scaling, they can induce an unnecessarily severe reduction since they represent the 100% credible bound. The lower and upper bounds elicited can be overly conservative. For instance an expert may specify that there is negligible chance (by which they mean nearly 100% chance, i.e. at least 99% chance) that the number of species exceeds some upper bound. However, the level of conservatism is not clear until they are asked to bring their bounds in a little to reflect a more realistic upper bound (e.g. to define an upper bound exceeded with exactly 5% chance). For this reason, it is important that the lower and upper bounds are elicited, and feedback is provided, on the original scale of the number of interest.
An alternative is to calculate scaling factors Δ U and Δ L , respectively for the lower and upper bound: , the values of B, L and U are converted into proportions by These numbers can then be fitted to the above Beta Distribution, to encode the parameters α and β.

Aggregating values
Now we return to the main problem of obtaining an aggregate estimate of X in Eq 1. One approach would be to simply sum the elicited valuesX ¼ P j B j , i.e., the best guesses for each component. However, this approach does not incorporate the expert's uncertainty, and does not account for discrepancies between elicited modes and encoded means. If all components were encoded using Normal distributions X j * N(μ j , σ j ), then we could straightforwardly apply analytical results so that X $ Nð P j m j ; ffiffiffiffiffiffiffiffiffiffiffi ffi P j s 2 j q Þ. However here we potentially have finite mixtures of Normal, lognormal, mirror lognormal and Beta distributions. Thus we use simulation to estimate the expected value of X and the corresponding variance and desired probability intervals.
As an illustration of a simple simulation approach, in the case study and comparative study described below, 100,000 random values were drawn from each distribution. The desired aggregate number was then obtained by summing the simulated values at each iteration. The best guess and the corresponding lower and upper probability bounds were calculated from the resulting empirical distribution of summed values.
A different type of aggregation problem occurs when there are multiple experts and the goal is to estimate a single distribution which captures their combined beliefs. In such cases, often a "consensus" approach is adopted in which a panel of experts is consulted. This is a particular form of a behavioural aggregation approach [23], where group discussions lead to a consensus, similar to the moderation phase of the Delphi method e.g. [38]. The disadvantage of this consensus approach is that disagreement between experts or domination of the panel by one or more experts may lead to biased outcomes e.g. [39,40].
A second way of combining these expert opinions is to elicit the information separately from individual experts and then combine these multiple opinions into a single distribution. This mathematical approach to aggregation [23] is particularly useful if the experts live far apart and cannot easily be assembled into an expert panel, and where, as in the case study, typically only one expert is available per component (here a taxonomic group). To estimate this consensus distribution mathematically, a common and simple method uses a mixture of experts model, via a suitably transformed weighted sum of the individual experts' distributions. This is also known as linear or geometric pooling [23]. A third approach employs statistical modelling to consider the consensus distribution in a multi-level model [35]. In this study we found that the standard approach of linear pooling was sufficient, and flexibly could be used, either to pool distributions across experts for a particular taxon, or when summing contributions (and their uncertainty) to overall counts arising from different sources.
The most "democratic" method of assigning weights to each expert is to apply an equal weight to the opinions of each expert. Alternatively, more weight might be given to some experts over others based on their assessed level of expertise [23]. Other work [41] has developed a conceptual model that outlines criteria for defining taxonomic expertise, that could provide a basis for transparently defining weights for each expert [42]. Regardless of the weighting, the problem reduces to a sum of distributions and can be addressed using the approaches described above. For this study, it was sufficient to consider equal weighting of experts in the few cases where multiple experts had knowledge on the same taxonomic group.

Comparative study
The comparative study was designed to evaluate the relative ability of the Normal and lognormal distributions, and the scaled Beta distribution, to fit a wide range of skewed distributions. The minimum values M L were set at 0, 50, 90 and 100 and the maximum values M U were 200, 210, 250 and 500. The lower and upper bounds were set at L = 100 and U = 200, respectively, with sureness values of 0.7, 0.8 and 0.9. All values of the best guess between the lower and upper limits were also tested by setting B to be a sequence between 100 and 200 incremented by one. The sureness, minimum and maximum values were selected based on the case study.
Estimates obtained for the sum of distributions, based on Eq 1, were also obtained for the two distributional approaches. These were compared using graphical displays of the data, fitted distributions and encoded lower, upper and mode values. The root mean square error (RMSE), calculated as the square root of the average of the squared differences between the estimated and true values, was also calculated for each distribution. This was used as a comparative measure of goodness of fit between the distributions. We also evaluated quantile-based skewness (γ B (0.5)) to supplement the elicited quantities by a description of the level of skewness in each scenario.

Case study
The case study presented in this paper involves encoding taxonomists' estimates of species richness within the class Malacostraca, and associated uncertainty about these estimates (Fig  2). Malacostraca is the largest of six classes of crustaceans and is comprised of the orders Amphipoda and Decapoda (Fig 2). The Decapoda in turn are comprised by the Anomura and Caridea. We were able to elicit information from one expert on Anomura, and two experts for each of Caridea and Amphipoda. There was no expert available for Malacostraca and Decapoda. All elicitations took approximately 1.5 hours to complete and all taxonomists expressed confidence that the process captured well their views of the global species richness of the taxa for which we elicited their expert knowledge. Other details about the elicitation process, including feedback, are reported in [7].
For this case study, 63.3% of the elicited estimates of uncertainty were reasonably symmetric, 26.67% were positively skewed and 6.67% were negatively skewed. Of the cases with positively skewed elicited uncertainty, the majority (20% of the total) were extremely positively skewed, that is B = L. For one expert, one of the estimates was zero. From this elicited information, the aim was to calculate the total number of species in each taxon and in Malacostraca. For each taxon and expert, the elicited information for each component was fitted to both the scaled Beta and Normal/lognormal distributions. Fig 3 shows three examples of the expert data and the performance of the scaled Beta and Normal/lognormal distributions fitted to these expert data. The first, (a), is an example of extreme positively skewed expert data (first expert for Amphipoda). In the second example, (b), the expert data are symmetric (from taxon Amphipoda). The last example, (c), illustrates negatively skewed expert data (second expert for Amphipoda). The elicited numbers (M L , L, B, U, M U , P) for these three examples were (200, 300, 300, 450, 600, 0.8), (30,50, 75, 105, 120, 0.9) and (100, 240, 343, 375, 450, 0.9), respectively (S1 Table).
In    Encoding Heavily Skewed Expert Count Data symmetric, with scaled Beta distribution γ B (0.5) = 0.044 and Normal γ B (0.5) = 0.091. The RMSE for both distributions were small, with Beta RMSE = 7.3 × 10 −5 and the normal RMSE = 2.041. For example (c), the Beta distribution correctly reproduced the expert data, but the mirror lognormal incorrectly encoded all three numbers, with smaller fitted values of L and U (229.9 and 370.6 respectively) and a larger fitted mode. Both distributions had similar negative skewness values at γ B (0.5) = −0.248 for scaled Beta distribution and γ B (0.5) = −0.232 for mirror lognormal. The RMSE further demonstrated that the Beta performed better than the mirror lognormal, with RMSE close to zero (7.96 × 10 −4 ) for the Beta compared to lognormal (RMSE = 10.198). Fig 4 shows the estimated total number of species for Malacostraca and the sub-taxa, in which all components estimated by all experts were encoded using the scaled Beta distribution (Fig 4a) and Normal/lognormal distributions (Fig 4b). In the few cases where there were multiple experts, the total number of species over both experts and from each expert are also shown.
The number of species for Amphipoda assessed by the first expert was approximately double that provided by the second expert. The first expert provided numbers that were positively skewed for all components, with all but one being extremely positively skewed. The scaled Beta distribution was more positively skewed, with a skewness value of γ B (0.5) = 0.527 compared with γ B (0.5) = 0.323 for the Normal/lognormal distributions. For the second expert, three of the components were symmetric, one was negatively skewed and the other was positively skewed. Both distributions had a skewness value very close to zero. Thus the distribution for the total number of species for Amphipoda was positively skewed. The skewness value was very similar between the two distributions, with scaled Beta distribution γ B (0.5) = 0.283 and Normal/lognormal distributions γ B (0.5) = 0.264. Therefore, the total number of species for Amphipoda, over both experts, irrespective of the distribution used, encompasses both of these estimates. For Caridea, there was an overlap in the number of species between the two experts. For both experts, the uncertainty was symmetric for all components. The skewness value was very close to zero for both distributions for each expert and both experts combined. Similar to Amphipoda, the total number of species for Caridea incorporates both experts' estimates, regardless of the distribution used.
When the expert data for a particular component reflected symmetry in their uncertainty, then the encoded mode was very similar between the two distributions. However, much larger differences were evident when the expert data was positively or negatively skewed. This difference was greater for extremely positively skewed data. For example, the first expert of Amphipoda provided a best guess of 18,000 for the undiscovered species component. The encoded mode for this component was 18,000 based on the Beta distribution (since this algorithm equates the best estimate to the mode), compared to an encoded value of 19,474 for a lognormal.
The shape of uncertainty in the total number of species for Malacostraca is similar between the scaled Beta and Normal/lognormal distributions. Both distributions resulted in a positive skewed shape, with the skewness value γ B (0.5) = 0.275 and γ B (0.5) = 0.245 respectively. For the Beta, the lower 95% credible bound, mode and upper 95% credible bound were 11,564, 18,877 and 39,384, respectively. The analogous values for the Normal/lognormal distributions were 12,882, 19,925 and 35,134. Therefore using the Normal/lognormal distributions resulted in a slightly inflated total number of species, with tighter confidence.

Comparative study
First, we investigate how well the Normal/lognormal distributions and the scaled Beta distribution fitted the simulated expert assessments, particularly for extreme positive and negative skewness. Second, we examine the impact of various levels of sureness on the fitted and elicited values.
The results of these simulations for P = 0.80 are displayed in Fig 5. A similar pattern occurred for other levels of sureness. The main purpose of this figure is to show that the scaled Beta distribution accurately encodes the mode, lower and upper values (y-axis) for the full range of best guess values (x-axis). Even at the extremes when the best guess coincided with either the lower or upper values, the scaled Beta distribution encoded the three assessments Encoding Heavily Skewed Expert Count Data accurately. In contrast, the Normal/lognormal distributions only correctly encode one of these values. By specification of the algorithm, the Normal distribution also reproduced the mode (yaxis) for all best guess values (x-axis). When the best guess was 150, then the normal distribution correctly encoded the lower and upper limits. However, as the best guess diverged from this number, i.e. as B moved towards the lower and upper bounds (of the 80% plausibility Encoding Heavily Skewed Expert Count Data interval) that is 100 or 200, then the encoded values of L and U from the Normal distribution deviated from the corresponding simulated (true) assessments. For all values of B between the lower bound (100) and 50th percentile (150), the fitted values of U from the lognormal distribution more closely matched the true values than for B and L. Correspondingly, the fitted values of L from the mirror lognormal were very similar to the simulated lower assessment, when the best guess was between 50th percentile (150) and upper bound (200). When B was approximately 130, then the lognormal distribution encoded B, L and U accurately. Similar results were observed for the mirror lognormal when B was 180. When the best guess was less than 130, the encoded mode and lower numbers from the lognormal deviated substantially from the simulated expert's numbers. Analogously, when the best guess was greater than 180, the encoded mode and upper values from the mirror lognormal also substantially differed from the simulated expert assessments.
For each distribution, the skewness (quantile-based measurement) was calculated for each simulated best guess, sureness, minimum and maximum. For different levels of P, M L and M U , similar skewness values were obtained. Fig 6a shows the skewness for each simulated best guess for both the Normal/lognormal and scaled Beta distributions. When the best guess was equal to 150, the skewness value was very close to zero for all distributions. For all best guess values, the skewness value of the scaled Beta distribution was closer to zero, i.e. less skewed, compared to the other distributions. At extreme positive (and negative) skewness, when the best guess equalled lower (or upper) number, the Normal has a skewness of γ B (0.5) = 1, the lognormal has γ B (0.5) = 0.422, the mirror lognormal has γ B (0.5) = −0.422 and the Beta has γ B (0.5) = 0.356. Since the skewness value for the Beta was the closest to zero, thus the Beta was best at estimating the mode at extreme positive (and negative) skewness.
The RMSE was calculated for each simulated best guess, sureness, minimum and maximum, for each distribution. Similar results were obtained for different levels of P, M L and M U . For both the Normal/lognormal and scaled Beta distributions, the RMSE is displayed for each simulated best guess in Fig 6b. For all best guess values, the scaled Beta distribution had a RMSE very close to zero, irrespective of the degree of skewness. The RMSE for the Beta distribution was also very close to zero for different levels of sureness, minimum and maximum values. When the simulated data exhibited extreme skewness (when B was equal to either L or U), then the RMSE was slightly larger, but still less than 0.5. For the Normal distribution, when the simulated values were exactly symmetric, i.e. when B was equal to 150, then the RMSE was close to zero but increased exponentially as B decreased or increased. Similarly, the RMSE values for the lognormal and mirror lognormal were almost zero when B was 130 and 180, respectively, but increased when B deviated from these values. This trend occurred for all combinations of P, M L and M U .
Overall, the scaled Beta distribution performed better than the Normal/lognormal distributions, i.e. the encoded mode, lower and upper values more closely matched the values nominated by experts for different levels of sureness, different ranges between minimum and maximum, and in the presence of either positive and negative skewness. In addition, the Beta distribution was better able to cope with extreme skewness.

Discussion
In this paper, the primary problem addressed is the estimation of the number of classes of a (hierarchical) classification. In particular, we have presented and evaluated statistical methodology for encoding of elicited information into several distributions from single and multiple experts. In our previous work [7], Normal and lognormal distributions were encoded when the expert provided answers as counts or multiplicative factors, and Beta distributions were encoded to elicited percentages. The details of encoding these distributions for elicited data were not specified in [7]. In this paper, we have provided these details and also presented an alternative distribution, the scaled Beta distribution, which can be encoded to all elicited numbers. This paper also describes how to calculate the total number of species within a group, estimate the total number across multiple experts and sum over multiple groups and experts. The two-parameter beta has a number of appealing features with respect to representing elicited statistics as a probability distribution. It is the conjugate distribution for the Binomial distribution and is also mathematically appealing when using it as a prior for other distributions that have a probability, proportion or rate as a parameter. It is mathematically related to a range of other distributions, including the Bernoulli, triangular, arcsine, F, exponential, Dirichlet and Pearson Type 1 distributions, for specific parameter values, which again can be helpful when developing prior distributions. It is a natural descriptor for order statistics, which provides another perspective on elicited probabilities The mean, mode and variance of the twoparameter beta distribution have closed forms and, although there is no such closed form for the median, accurate approximations exist in terms of very small relative and absolute errors. The relationships between the mean, median and mode, and between the skewness and kurtosis of this distribution, are well defined. Moreover, the relative entropy, or Kullback-Leibler divergence, which measures the inefficiency of assuming incorrect parameters for the distribution, can be derived in closed form. Finally and perhaps most importantly, this distribution is very parsimonious in that has only two parameters yet can represent a very wide variety of shapes, including symmetric, U-shape, two-point, uniform, semi-elliptic, parabolic, bellshaped, spike, skewed, J-shaped and reverse J-shaped, right triangular, straight line, and so on.
The consideration of expert judgements as deviations from an overall (two-parameter) Beta distribution is consistent with the idea of random behaviour of percentages and proportions. It is also consistent with a Bayesian perspective that each expert can indeed have a unique distribution that can be modelled by a Beta distribution with an associated uncertainty of estimation, and that these individual distributions can be combined in a formal, effectively varianceweighted, manner around an overall Beta distribution.
Given the above advantages of the Beta distribution, it is difficult to consider a more appropriate alternative for the purpose of representing the three statistics elicited from experts in our study. Exceptions are the Normal, log-normal and mirror log-normal distributions, because of their familiarity, mathematical tractability and the independence of their parameters. It is for this reason that we conducted a comparative assessment of these distributional choices.
The appeal of the scaled Beta distribution compared with the Normal and lognormal distributions is that it can capture all quantities likely to be encountered when enumerating population components, namely raw or relative counts or proportions, either in the presence of high negative or high positive skewness. Moreover, the case study and comparative studies presented in this paper indicate that this distribution performed better than the Normal and lognormal distributions over all levels of skewness. Specifically, in the case study, the encoded summary statistics from the scaled Beta distribution closely matched the actual values elicited from the expert. This distribution was also able to provide accurate encoded values under extreme skewness, for example when the experts' best guess equalled either their lower or upper estimate. Such situations did arise during the elicitation of species richness estimates from our expert taxonomists [7]. For example, in the case of class Malacostraca, 20% of the expert's best guesses equalled their lower estimates. In the comparative study, for all simulated best guess values, the Beta distribution was able to accurately encode the lower, mode (a best guess) and upper numbers, resulting in RMSE values close to zero. In contrast, the encoded values from the Normal and lognormal distributions were different from the simulated experts' values and the RMSE values were very large for the majority of simulated cases. For more symmetric expert data, when the best guess was equal to or very close to the 50% quartile, accurate encoded values were obtained using all three distributions, with the RMSE values close to zero. In the case study presented here, over half (63.3%) of all data elicited were relatively symmetric.
The encoding method proposed in this paper is different to the ones provided in MATCH's Uncertainty Elicitation Tool [27]. To clarify this we compare five features of these two approaches: A) elicitation method (in terms of the protocol, as described in [13]); B) the quantities that are elicited; C) the order in which quantities are elicited; D) the extent of skewness permitted; and E) the choice of distribution. In addressing these five features, we consider five methods for encoding a distribution provided by MATCH. Quantities elicited fall into one of five categories, leading to different forms of elicitation method (item A). These are: (1) a discrete version of the probability distribution, which is a specific version of the P-method of [13]; (2) a median and 1 or 2 quantiles with fixed probability (section 3.1.2 and 3.1.3 of [27]), which is the Q-method of [13]; (3) three probabilities for user-specified intervals (section 3.1.4 of [27]), which is the P-method of [13]; or (4) a median together with two such probabilities (section 3.1.5 of [27]), a hybrid PQ-method of [13]. These five methods thus directly correspond to the P-, Q-or PQ-methods for encoding probabilities, also reviewed in [23] and [32] and compared to the Elicitator method in [25]. In contrast, our tool asked experts to specify the mode together with two pairs of quantiles and the associated probability of each pair, which a modeextended version of the PQ-elicitation method [13].
In terms of elicited quantities (B), our method asks the expert to specify the most plausible value, which is encoded as the mode rather than the median, and is not sought in any of the MATCH methods (1-4), but when working with ecological experts has proved more intuitive [3,25]. Although when encoding the normal distribution, these quantities are mathematically equivalent, the wording of questions is necessarily different. For the order of elicitation (C), in contrast to MATCH and the four-point elicitation method [43], our tool enforces an "outside-in" approach to specifying the distribution [25,26], by first asking for the lower and upper bounds, and then seeking uncertainty/sureness around these bounds, and finally asking for the mode (best guess). This helps reduce the potential to misinterpret the bounds as a confidence interval on the best estimate, rather than the desired (and hence broader) bounds on plausible range of values for the variable. In addition, regarding the extent of skewness (D), we allow for extreme skewness, where the mode is equal to the lower or upper bound. As discussed in the Introduction, there are several reasons that extreme skewness may arise in some cases, such as the motivating case study in taxonomy. For example, both the best guess and lower bound (on the number of species in a taxon) may correspond to the known published number [11]. Finally, the choices of distribution (E) both MATCH and our tool include the normal, log normal and Beta. Our tool focuses on supporting elicitation of skewed counts by also including the mirror lognormal distribution, whereas MATCH focuses on supporting elicitation of data types beyond counts and proportions, via the t, log t, and gamma distributions.
Six distributions are available in the MATCH tool for describing the elicited values, namely the normal, t, beta (scaled if the limits are not 0 and 1) gamma, log normal and log t. The software reports which distribution best fits the judgement; highly skewed judgements can lead to choice of the (scaled) Beta distribution. In the methodology underlying the MATCH tool, it was assumed that the elicited median would not equal either the elicited minimum or maximum, and hence does not allow for extreme skewness.
As detailed above the elicitation method proposed in this paper is quite different to MATCH's Uncertainty Elicitation Tool [27], in terms of elicitation method, elicited quantities, and order of elicitation. In addition, we allow for extreme skewness, i.e. the mode being equal to the lower or upper number, which also guides the selection of different distributions for encoding.
As discussed above, this paper also details a case study focused on estimating the total number of species globally for a small subset of taxa on coral reefs. Estimation of total species richness on Earth, whether by taxon, habitat, ecosystem or the entire planet is an important and difficult problem that has defied resolution for many decades [44]. Because the taxonomy of the worldâ??s biota is still substantially incomplete [45], there is currently no opportunity to validate estimates of global species richness, rendering this a post-normal science situation as described earlier. As a consequence, global estimates of species richness must rely on some form and/or combination of functional models, species discovery curves, or the use of expert knowledge of how many species have been described and named and how many more are likely to remain to be discovered. Functional models and the use of species discovery curves, however, have proved ineffective to date if convergence on agreed estimates is used as the criterion of success [44]. In our opinion, elicitation and use of expert knowledge provides the best option currently available for achieving convergence in these estimates; see [42] for further discussion. Given the possibility that many millions of species remain to be discovered, described and named and that the pool of taxonomists available to accomplish this task is limited, harnessing expert knowledge to provide better estimates of species richness presents its own challenges.
Some of these challenges of describing expert opinion were clearly evident in the case study we presented. The distributions for the various taxa varied from strongly negatively skewed, through to symmetrical and strongly positively skewed dependent largely on the opinion of the experts about the completeness of the taxonomy of those particular groups; e.g., the more complete the taxonomy or where a high degrees of synonymy was suspected the more likely was B to approximate U. Despite differences in B between taxonomists elicited for a particular taxon, multiple taxonomists consistently ranked Caridea as being less than half as rich as Amphipoda. It was also evident that of the distributions fit to these data, the scaled Beta distribution proved best on a series of criteria and is therefore recommended for future applications. In addition, when multiple experts are available to provide estimates for one class or taxon they may provide, as was the case here, quite different estimates and skewnesses. These separate estimates and skewnesses must be taken into account in assembling a composite estimate for that taxon, and these taxon-specific estimates must then be combined if an estimate higher up the taxonomic hierarchy is desired. The method described here is "democratic", in that it captures and formally combines the estimates elicited from all of the experts. Using the scaled Beta distribution makes combining estimates from multiple experts simpler.
This approach to describing expert opinion can be easily modified to suit various research questions, where the problem involves estimating (hierarchical) counts. For example, it is applicable to estimating the number of species in other terrestrial and marine environments or estimating population sizes of rare species. It is also applicable in quite different contexts, such as estimating the number of homeless people in a country. This is a well known difficult problem [14,[46][47][48] for which expert opinion could be a valuable complement, if it is elicited and represented accurately and adequately. We suggest that the methodology proposed here could achieve this.
Supporting Information S1 Text. We provide additional supporting details of the MSS for lognormal and Beta distributions. (PDF) S1 Table. This table contains the Malacostraca data. First column is the expert number. Next 12 columns relate to the taxonomic classification. Last six columns contain the expert data, which includes minimum M L , lower L, best estimate B, upper U, maximum M U and sureness P. (XLSX)