Efficient estimation of stereo thresholds: What slope should be assumed for the psychometric function?

Bayesian staircases are widely used in psychophysics to estimate detection thresholds. Simulations have revealed the importance of the parameters selected for the assumed subject’s psychometric function in enabling thresholds to be estimated with small bias and high precision. One important parameter is the slope of the psychometric function, or equivalently its spread. This is often held fixed, rather than estimated for individual subjects, because much larger numbers of trials are required to estimate the spread as well as the threshold. However, if this fixed value is wrong, the threshold estimate can be biased. Here we determine the optimal slope to minimize bias and maximize precision when measuring stereoacuity with Bayesian staircases. We performed 2- and 4AFC disparity detection stereo experiments in order to measure the spread of the disparity psychometric function in human observers assuming a Logistic function. We found a wide range, between 0.03 and 3.5 log10 arcsec, with little change with age. We then ran simulations to examine the optimal spread using the empirical data. From our simulations and for three different experiments, we recommend selecting assumed spread values between the percentiles 60–80% of the population distribution of spreads (these percentiles can be extended to other type of thresholds). For stereo thresholds, we recommend a spread around the value σ = 1.7 log10 arcsec for 2AFC (slope β = 4.3 /log10 arcsec), and around σ = 1.5 log10 arcsec for 4AFC (β = 4.9 /log10 arcsec). Finally, we compared a Bayesian procedure (ZEST using the optimal σ) with five Bayesian procedures that are versions of ZEST-2D, Psi, and Psi-marginal. In general, for the conditions tested, ZEST optimal σ showed the lowest threshold bias and highest precision.


Introduction
Bayesian and maximum-likelihood procedures are widely used in psychophysics to estimate detection thresholds. Given that these are parametric procedures, the experimenter has to provide in advance the parameters for the assumed psychometric function or model function [1]. PLOS  Several simulation studies have shown the importance of choosing the adequate parameters. In general, when there is a mismatch between the subject's psychometric function and the model function, threshold bias (i.e. the difference between the estimated and the real threshold) emerges [2][3][4][5]. Thus, before using Bayesian procedures (e.g. ZEST, [6]; QUEST, [7]), a previous experiment is ideally needed to estimate the subject's psychometric function in order to avoid biased estimates. However, this will require more trials than is often practical, especially in a clinical setting. For this reason, parameters other than threshold are often held fixed at a constant value for all participants, with only the threshold fitted for each individual.
In most simulation studies that examine the mismatch between the subject's psychometric function (ψ) and the assumed psychometric function (i.e. the model function, M), the authors use maximum-likelihood procedures, however, those results can be extrapolated to Bayesian procedures given that a Bayesian procedure using the mode of the posterior distribution as estimator and a uniform distribution as a priori distribution (instead of an informative distribution) is equivalent to the maximum likelihood estimation [1]. We therefore do not make a distinction between these approaches when discussing the literature.
One particularly important parameter of the psychometric function is the spread (σ), or equivalently, the slope of the model function. Slope and spread are inversely related; see S1 Appendix for details. Using a maximum-likelihood procedure, Green [4] (in a 2AFC task) found that the variability of the threshold estimate is lower when slope is underestimated, i.e. the model function was shallower (and its spread larger) than the observer's true psychometric function. In line with this finding, Treutwein [1] recommends underestimating the slope (i.e. overestimating the spread or the standard deviation) relative to subject's slope when using a dynamic stopping criterion. Emerson [2] found, using a maximum-likelihood procedure in a yes-no task, that threshold estimates were biased when the true slope was overestimated, whereas no bias was evident when the slope was underestimated. Madigan & Williams [3] (in a 2AFC task) compared Best PEST [8] and QUEST [7] using large lapse rates (20%). In general, they found that slope mismatches were not a serious problem for maximum-likelihood methods, however, they found that Best PEST procedure performs better (greater accuracy) when they overestimate the subject's slope. Interestingly, they found the opposite for QUEST: underestimation of the true slope yielded better accuracy. Using Bayesian staircases (in a yes-no task) Alcalá-Quintana & García-Pérez [5] found that underestimating the subject's slope (i.e. overestimating the spread) reduced the bias of the threshold estimation. This reduction was important if subject's and assumed psychometric functions were different (e.g. Weibull vs. Logistic), however, they claimed no bias when both psychometric functions were identical. In summary, almost all studies recommend underestimating the subject's slope (i.e. overestimating the spread) in order to get higher accuracy and smaller variance of the threshold estimates, but there are no recommendations about the degree of the underestimation needed to obtain thresholds with the smallest bias and standard deviation.
Our aim is to find the optimal slope (or spread) value in order to estimate stereo thresholds using a general Bayesian procedure. Our Bayesian procedure is described in the S1 Appendix (section A2). It assumes a uniform a priori distribution (non-informative prior); it uses the Logistic function for the model function; for selecting the stimulus intensity in each trial it computes the mean of the posterior distribution; and the threshold corresponds to the mean of the final distribution. This Bayesian configuration has the basic characteristics recommended from simulation studies [5,6,9,10]. In summary, it is similar to ZEST [6] but using a uniform a priori distribution. Monte Carlo simulations confirm that with this procedure, better results are obtained when the slope is underestimated than when it is overestimated. This is illustrated in Fig 1, which shows the simulation results for (A) 2AFC and (B) 4AFC tasks. Each panel shows the simulations for a different choice of the subject's spread value (σ S ).
The abscissa values correspond to the spread value of the model function (σ M ) used in the Bayesian procedure (larger spread values correspond to smaller slope values). Thus, the red dashed lines show the best case, where the assumed value is correct for the subject, σ M = σ S . For the conditions tested, the results show a stronger bias for 2AFC ( Fig 1A) than for 4AFC ( Fig 1B). In general, all panels show the classical result, overestimation of the subject's slope (smaller spread than the subject's spread) produces stronger bias than underestimation of the subject's slope (or overestimation of the spread). The vertical dashed red line signals the spread value of the subject. For both tasks and for almost all spread values of the subject, bias and SD rise more slowly to the right of this line than to the left. Thus, in agreement with previous results, underestimation of the slope (overestimation of the spread values) yields better results than overestimation [4].
However, Fig 1 also shows that thresholds become more biased and less reliable if the slope underestimation is too severe (i.e. σ M is too high). Thus, there is no unique model spread value (σ M ) that works well for all possible spreads of any observer (σ S ). However, it would be possible to find the optimal spread value if the distribution of spreads for the particular task is known. We could then use Monte Carlo simulations to estimate the optimal model spread (σ M ) that minimizes bias and maximizes precision for the given distribution of spreads in the population.
Simulations using adaptive Bayesian staircases. Each panel shows the mean threshold bias (estimated threshold-subject threshold) ± 1 SD, as a function of the spread (σ M ) of the model function. The subject threshold was fixed to be θ S = 1.5 (logarithmic value; 31.6 arcsec) that corresponds to the probability of correct responses of π = 0.75 for both tasks. Each panel shows the simulations for one particular spread of the subject (σ S = [0.5, 1.5, 2.5]), the vertical red-dashed lines signals the spread of the subject psychometric function. The horizontal red-dashed line marks the zero-bias value. A. Results for a 2AFC task with guess rate γ S = 0.5 and lapse rate λ S = 0.02. B. Results for a 4AFC task with guess rate γ S = 0.25 and lapse rate λ S = 0.03. Lapse and guess rates were the same for the subject and the model function. Each dot corresponds to 2000 simulations of a Bayesian staircase of 30 trials (see S1 Appendix). Note that a spread value of the model function lower than the real one (subject's spread) introduces bias in threshold estimation, positive bias in 2AFC and negative bias in 4AFC task (this is different if different probability of correct responses is chosen). See more details in the main text.
It is well known that there are Bayesian methods like MUEST [11], ZEST 2D [12], Psi [13], Psi-marginal [14] or QUEST+ [15] that can be used to estimate the threshold and the slope simultaneously (and some of them the lapse rate too), however this has a cost in terms of number trials needed or for example, when the slope parameter is unknown, MUEST (a multidimensional extension of QUEST to estimate threshold and the slope simultaneously) performs better than QUEST [7], on the other hand, if the slope parameter is known, QUEST performs better (i.e. lower threshold bias) than MUEST [11]. There are versions of ZEST 2D, Psi-marginal, or QUEST+ that can be modified to estimate only the threshold parameter treating the slope parameter as a nuisance parameter or marginalizing it, however, it is not known whether they will perform better than a well-known Bayesian method like ZEST with the optimal slope. Thus, in this paper, our objectives are: (i) to determine the optimal slope value for the model function in order to estimate reliable thresholds on stereoscopic disparity detection tasks when using Bayesian procedures; (ii) to quantify the threshold bias and imprecision expected for subjects with different values of spread. To this end, we have performed three different stereo experiments (two measuring global stereopsis and one measuring local stereopsis) in which we estimated the threshold, the spread and lapse rate of 260 participants using an adaptive weighted one-up one-down staircase in two tasks, 2AFC and 4AFC. We have performed Monte Carlo simulations using the distributions of the empirical spreads from those three experiments and we have estimated an optimal spread value for each experiment. (iii) Finally, in order to test the effectivity of using the optimal slope in a Bayesian procedure, we will compare a Bayesian method (ZEST with optimal σ) with other five Bayesian procedures performing Monte Carlo simulations using the empirical data (thresholds, slopes, and lapse rates) of each subject as model psychometric function and two tasks (2AFC and 4AFC).

Participants
We performed three experiments with different number of participants. In Experiment 1 (2AFC_global) we tested 79 participants (33 females) (aged from 4.6 to 61 years, mean = 19.5 years, SD = 14.05). In Experiment 2 (4AFC_global) we tested 82 participants (37 females) (aged from 5 to 21.8 years, mean = 10.02 years, SD = 3.14). In Experiment 3 (4AFC_local) we tested 99 participants (45 females) (aged from 4.75 to 36.08 years, mean = 10.09 years, SD = 4.86). All subjects completed 80 trials or more up to 120 trials. All subjects decided at the beginning of the experiment whether perform the experiment with 80 or with 120 trials. Adults and children were recruited at the Centre for Life science center in Newcastle upon Tyne, UK (http://www.life.org.uk/). Adult participants provided informed written consent. The parents or other accompanying adults of child participants provided written consent for children. The study protocol was compliant with the Declaration of Helsinki and was approved by the Ethics Committee of the Newcastle University Faculty of Medical Sciences (approval number 00625).

Equipment
Stimuli were presented on a 23-inch passive 3D monitor (D2367PH, AOC) that uses passive polarization with a spatial resolution of 1920 × 1080 pixels (52 cm × 29 cm) and a refresh rate of 60 Hz. Left and right images were separated by circular-polarization 3D glasses. Observers were seated at 90 cm from the monitor, so a pixel subtended 60.4 arcsec. To stabilize the observer's head and to control the observation distance a chin rest (UHCOTech HeadSpot, Houston, TX) was used. In the 2AFC task (Experiment 1), subjects indicated their response by pressing the left or right button of a standard computer mouse. In the 4AFC tasks Experiment nologies). The four corner buttons corresponded with the four spatial locations of the 4AFC (see Fig 2B). Data were collected on a DELL workstation (Intel(R) Core (TM) i3 CPU 540 @3.07GHz, 4GB RAM, 64-bit Operating System, Windows 7), with a GeForce GTX 460 graphics card (NVIDIA), running MATLAB R2012a, 64-bit (Mathworks). The experiments were programmed using Psychophysics Toolbox extensions [16][17][18].

Stimuli
All stimuli were generated using Matlab (R2012a, Mathworks). In Experiments 1 (2AFC_global) and 2 (4AFC_global), we designed the stimuli to measure global (or cyclopean) stereopsis. These experiments differed only in the number of alternatives. In Experiment 3 (4AFC_local), using a 4AFC task, we measured local (feature or contour-based) stereopsis. The stimulus used is a version of the Randot Circles stereotest (Stereo Optical Company, Inc., USA) commonly used in clinical screening. For Experiments 1 and 2, we used dynamic random dots of different colors on a black background (see Fig 2A and 2B). The color of the dots was generated by selecting the RGB values independently from a uniform distribution between minimum and maximum luminance. The position and color of the dots was random and updated every frame at 60 Hz. The dots were generated using the function "Screen('DrawDots')" of the Psychophysics Toolbox extensions [16][17][18]. Given that the stereoscopic presentation system uses line interleaving to dissociate left and right images, the dots appeared as ellipses, with a width of 10 pixels and a height of 20 physical pixels (10.07 × 20.14 arcmin). Subpixel disparities were achieved by using an antialiasing technique (see [19]). In Experiment 1, we used a 2AFC task and the target was a random dot stereogram of 8.4 × 8.4 deg with crossed disparities presented on top of a surround composed of random dots with uncrossed disparities (see Fig 2A). The target was presented on the left or right side of the screen at 9.3 degrees from the midline. In Experiment 2, we used a 4AFC task, and the target was also a random dot stereogram of 4.3 × 4.3 deg with crossed disparities presented in one of the four corners of the screen 6.5 deg away from the centre (see Fig  2B). In this experiment the target was located in the center a surrounding rectangle of random dots with uncrossed disparities (9.3 × 7.4 deg, width × height). Target and background had equal and opposite disparity relative to the screen. This procedure reduces monocular cues that could be present for high disparities [19]. The stimulus disparity was defined as the relative disparity between the target and background (see an anaglyph version in Fig 2D). Fig 2C shows an example of the stimulus used in Experiment 3. The stimulus consisted of four black circles on a square background filled with static white noise. Each square had a size of 5 deg and each circle a diameter of 2.65 deg. The circle was constructed using a 2D isotropic window with a Butterworth profile (see equation in Appendix A in [20]), with this luminance profile we can achieve subpixel disparities. We applied the disparity to one of the circles and the background was set to zero disparity (see anaglyph version in Fig 2D).

Procedure
In all experiments we used the same adaptive weighted one-up one-down staircase (see a simulation study about this type of staircase in [21]). The staircase started with a practice trial at a disparity of 3 log 10 arcsec (or 1000 arcsec). In the first trial of Experiments 1 and 2, in addition to the disparity, all target dots were presented in red at maximum luminance (see Fig 2A and  2B). This non-stereo color/luminance cue was added to the practice trial to ease understanding of the task. This cue was eliminated after the first trial; therefore, after the first trial the target could only be discriminated based on disparity. The staircase works in log 10 values of the disparity. After each correct response, log-disparity was reduced by subtracting 0.15 log 10 arcsec (i.e. disparity was divided by a factor of 1.41). Following each incorrect answer, log-disparity was increased by three times this value or 0.45 log 10 arcsec (i.e. disparity was increased by a factor of 2.81; see two examples of this staircase in Fig 3, left panels). In total, 94% of the participants completed 80 trials and 6% 120 trials. No feedback about correctness of responses was provided. The stimulus on each trial was presented until the participants made a response, thus the experiment proceeded at a pace determined by the participant. All experiments were carried out in a dimmed area at the Centre for Life.

Fitting procedure
First, we computed the proportion of correct responses observed for each disparity presented to the subject. Then, for each subject we fitted the Logistic psychometric function (described in S1 Appendix; red lines on the right panels of Fig 3) to the proportion of correct responses by the method of maximum likelihood, using the function "fminsearch" of Matlab. This is the same function that will be used as a model function in the Bayesian staircases. We fitted the psychometric function with three free parameters, threshold, θ; spread, σ; and lapse rate, λ, with constrains θ>0, σ>0, and 0<λ<0.06. The guess parameter γ was fixed to 0.5 for 2AFC and 0.25 for 4AFC. We established the disparity threshold as the value that corresponds to the probability of correct responses of π = 0.75 for both, 2AFC and 4AFC tasks. We assessed the goodness of fit of the psychometric function using the likelihood ratio or deviance (Dev) described in equation 5 of Wichmann & Hill [22]. We fixed a priori the ranges for valid fittings, θ �[1, 500] arcsec; σ �[0.01, 7]; and the probability p(Dev) associated to the deviance p (Dev)>0.05. Those subjects that didn't meet these criteria were discarded from the analysis. The lower value of the range for σ (0.01) was chosen taking into account that previous simulation studies, with this type of adaptive staircase, have shown spread estimation errors (spread values close to 0) for a small number of trials (see Fig 4 of [21]).  averages of the spreads (σ) for each Experiment 1 were, in log 10 arcsec: � s 2AFC_g = 1.005 ± 0.776 (mean ± SD); � s 4AFC_g = 1.186 ± 0.74; � s 4AFC_l = 1.306 ± 0.632. A spread of 1 log 10 arcsec means that 95% of the range of the psychometric function occurs over a factor of 10 in disparity. For example, if performance is 2.5% above chance when disparity is 10 arcsec, then with σ = 1 log 10 arcsec it will be 2.5% below maximum when disparity is 100 arcsec. Conversely, a spread of 1.3 log 10 arcsec corresponds to a factor of 20.

Determination of the optimal slope value
In this section we describe the procedure for obtaining the optimal slope or spread (σ M ) value of the model function in order to measure disparity thresholds with small bias and standard deviation when using Bayesian staircases. To do this, we are going to use the spreads, the thresholds, and the lapse rates from the three stereo experiments, two types of task (2AFC and 4AFC), and two types of stereopsis (global and local). The procedure is described in Fig 7. For each experiment we determined 19 different percentiles (from 5 to 95% in steps of 5%) from the distribution of spreads (see example for percentiles 25 and 75% in Fig 7). This gave us 19 different spread values from each experiment. Then, we used each spread value as the spread of the model function (σ M ,) to run Monte Carlo simulations. For these simulations we used the empirical data (using subject's thresholds θ S , spreads σ S , and lapse rates λ S ) from our experiments in order to model the subject's psychometric function. In the simulations, the subject responses were replaced by a pseudorandom binary number generator in which the probability of a correct response was read off from the modelled subject's psychometric function evaluated at the disparity presented on each trial. For each subject we run 2000 simulations using Bayesian staircases of 30 trials (see characteristics of the procedure in S1 Appendix (section A2)). We computed the threshold bias subtracting the estimated threshold and the empirical subject's threshold (ŷ À y S ) (Bayesian staircases converged into the probability 0.75 as in the psychophysical experiments) and then, we computed the mean of the 2000 bias estimations. Finally, we average the means bias for all participants (e.g. for Experiment 1, 71 subjects). Thus, one single dot represented in the bottom panel of Fig 7 (or Fig 8A) was obtained by running 142,000 staircases, made up of 1 percentile (i.e. one spread, σ M ) × 71 subjects × 2000 Bayesian staircases. condition the only difference between the model function and the subject's psychometric function was the spread value; B) grey squares, representing the more realistic condition where the individual lapse rates are unknown and so results are obtained for a fixed lapse rate of the model function. For 2AFC we used λ M = 0.02 and for 4AFC, λ M = 0.03 (both corresponding to λ � = 0.04). We used higher lapse rates than the empirical ones because this produces lower bias and standard errors when using Bayesian staircases [5]. Fig 8A shows the results for Experiment 1 (2AFC_g). In general, threshold estimations are less biased when the model function uses fixed lapse rate (gray squares), but this is only evident for small spread values (i.e. large slope values). The results show a clear pattern: threshold bias and standard deviation are reduced with increasing spread values (i.e. reducing slope values) of the model function. In order to decide what is the optimal spread, we compute the absolute value of the product of the bias and the standard deviation. The vertical dashed lines mark the spread (σ M ) where this product is minimum. In this case, for the 2AFC task, the spread that will produce smaller threshold bias and standard deviation when using a Bayesian staircase for both conditions (fixed lapse rate and subject's lapse rate) is σ M = 1.701 log 10 arcsec (percentile 80%).  Efficient estimation of stereo thresholds Fig 8B shows the results for Experiment 2 (4AFC_g). In this case, we found an opposite result to the one presented in Fig 8A: this is, threshold estimations are more biased when the model function uses fixed lapse rate (gray squares), but again, this is only evident for small spread values (i.e. large slope values). For fixed lapse rate the optimal spread was σ M = 1.496 log 10 arcsec (percentile 70%); and when we use the subject's lapse rate, it is σ M = 1.376 log 10 arcsec (percentile 65%). Note that in practical terms this condition is very difficult to happen in empirical studies, we usually don't know the value of the subject's lapse rate. However, this simulation tells us that the percentiles for optimal spread are almost independent of the subject's lapse rate. Fig 8C shows the results for Experiment 3 (4AFC_l). In this case, as in Experiment 2 (4AFC_g), threshold estimations are more biased when the model function uses fixed lapse rate (gray squares). For fixed lapse rate the optimal spread was σ M = 1.463 log 10 arcsec (percentile 60%); and for the subject's lapse rate it is σ M = 1.605 log 10 arcsec (percentile 70%). We also found very similar results (results not shown) using Bayesian staircases of 100 trials and also using a probability of correct responses of 0.7.

Results of the simulations
In summary, the simulations using the empirical data of the experiments show that threshold bias and standard deviation are reduced when spread values correspond to percentiles of the distribution of spreads between 60 and 80%. These percentiles do not depend on the lapse rate of the subjects. Selecting spreads below the percentile 60% introduce bias in all Experiments and spreads higher than 80% introduce a small bias in the Experiments using 4AFC tasks (see left part of Fig 8B and 8C).

Comparing our procedure with different Bayesian adaptive procedures
In addition to the Bayesian adaptive procedures which simultaneously estimate threshold and the slope of the psychometric function, for example, ZEST 2D [12], MUEST [11], or Psi [13], recently adaptive procedures have been developed that allow us to estimate even more parameters (threshold, slope, and lapse rate) of the psychometric function, such as Psi-marginal [14] or QUEST+ [15]. However, to estimate two or more parameters in an efficient way (small bias and standard deviation) requires more trials than to estimate one single parameter. In clinical applications this can be problematic. For example, our laboratory is designing a stereo test to be run on a 3D tablet [31,32] in order to measure the stereoacuity in small children, so it is fundamental to obtain efficient stereo thresholds with a small number of trials.
In this section, our objective is to compare our proposed procedure that we will call "ZEST optimal σ" with five Bayesian adaptive procedures using Monte Carlo simulations. We tested different versions of the ZEST 2D, Psi, and Psi-marginal procedures. We tested two versions of the ZEST 2D; in one version we estimated two parameters (threshold and slope; "ZEST 2D θ, σ"); in the other version we only estimate the threshold ("ZEST 2D θ"). We tested the original Psi method to estimate threshold and slope (Psi θ,σ) and we also tested three configurations of the Psi-marginal: one configuration where we estimated the threshold and the slope while marginalizing the lapse rate (Psi θ,σ(λ)), a second configuration where we estimated the threshold only which marginalizing the slope and the lapse rate (Psi θ,(σ,λ)), and a third configuration where we estimated the threshold, marginalize the slope, and fixed the lapse rate (Psi θ(σ)). The details of these seven procedures are described in S1 Appendix (section A3).
In order to compare all Bayesian procedures, we have performed two types of Monte Carlo simulations: a) using a single model assuming a standard observer based in our data (see Figs 9 and 10) and, b) using different models with the parameters of the empirical data (thresholds, slopes, and lapses) obtained from the experimental conditions 2AFC_g and 4AFC_g (see Figs 11 and 12).
We also have performed simulations for 2AFC and 4AFC tasks. For each task we have used two types of prior probability distribution of slopes, the uniform distribution (i.e. non-informative distribution) and a Gaussian distribution based in our data. Although there are hierarchical adaptive approaches like HADO [33] that allow us to construct informative priors based on data of observers that have previously performed the tasks, in order to simplify the comparisons and use the same priors for each method, we have assumed a Gaussian distribution with mean and standard deviation obtained from the data shown in Fig 4A (2AFC_g)    , and threshold of θ S = 1.5 log 10 arcsec. We have chosen these parameters because, according to our data, they represent rather well a standard subject in a stereoacuity task (see these two psychometric functions in Fig 1A, right panel). Note that our ZEST optimal σ makes incorrect assumptions about this model observer: it assumes a spread of 1.5 log 10 arcsec, not the true value of 1 log 10 arcsec. The left panels of Figs 9 and 10 show the mean of the threshold bias (ŷ À 1:5), and the right panels show the standard deviation of the distribution of threshold bias. Each line corresponds to 1000 simulations of each Bayesian procedure for different number of trials, thus, the mean and the SD for each trial was obtained from 1000 threshold bias. Results from Figs 9 and 10 show that the threshold bias is smaller for 4AFC than 2AFC for the same number of trials, in agreement with previous results [34]. Fig 9 shows that all procedures increase in precision (i.e. lower threshold bias) with increasing number of trials. For 2AFC task, our suggested procedure (black line, ZEST optimal σ) shows the smallest bias and smaller SD in general and for small number of trials (between 20 and 50) in particular. For 4AFC task, for trials greater than 30, all procedures give similar results where Psi θ,σ(λ) (pink line) shows the worst performance in terms of threshold bias, and Psi θ(σ) (brown line) shows the smallest threshold bias. Note that the black lines corresponding to Zest optimal σ are the same for Figs 9 and 10 given that we are not using a prior distribution for β for this method. Note also that in empirical testing, the SD of threshold's estimates obtained with Bayesian staircases longer than 70 trials may not decrease with the increasing number of trials [35]. Fig 10, when a uniform distribution is assumed for the slope, shows that for both 2AFC and 4AFC our procedure (black line) shows the lowest threshold bias and SD for trials greater than 20. Other good options would be the method ZEST 2D θ (green line) and Psi θ,(σ,λ) (cyan line). Note that, although the method ZEST 2D θ,σ (red) shows small threshold bias, the SD is the highest of all procedures.
In Figs 9 and 10, for ZEST optimal σ, we chose parameters such that the model function underestimated the slope of the simulated observer and assumed the correct lapse rate. We also assumed the correct lapse rate for ZEST 2D θ, ZEST 2D θ,σ, Psi θ,σ, and (Psi θ(σ)). These choices may favor these approaches over the configurations of Psi-marginal Psi θ,σ(λ) and Psi θ, (σ,λ). In real life we will encounter observers with unusually low slopes and with lapse rates different from those we assumed. To assess performance under these conditions, Figs 11 and 12 show the simulation results using different modelled subjects with the parameters of the empirical data (thresholds, slopes, and lapses) obtained from the experimental conditions 2AFC_g and 4AFC_g. Fig 11 shows  The left panels of Figs 11 and 12 show the average of the threshold bias of all subjects. For each subject (out of 71 for 2AFC_g and 68 for 4AFC_g) we have run 1000 simulations for different number of trials (between 10 and 100 and between 10 and 500 for Fig 12). The right panels show the standard deviation. Note that, like in Figs 9 and 10, the black lines corresponding to Zest optimal σ are the same for Figs 11 and 12 given that we are not using a prior distribution for β for this method. Fig 11 show that ZEST optimal σ (black line) has the lowest threshold bias for both tasks and is constant from trials higher than 20. The rest of the procedures show a higher threshold bias that is very pronounced at small number of trials. The procedure ZEST optimal σ also shows the smaller standard deviation for all number of trials tested. Fig 12, when assuming a uniform distribution for the slope, shows similar results although in this case, for the 4AFC task, the procedure ZEST 2D θ (green line) also shows very small bias and standard deviation. For small number of trials (< 30) ZEST optimal σ is recommended. Thus, in general, if the number of trials is small and we have knowledge about the distribution of slopes, we recommend the procedure "ZEST optimal σ", however, taking into account that the distribution of the slopes in an experiment is probably unknown, the procedure ZEST 2D θ is also a good candidate if the experimenter is only interested in estimating thresholds and assumes the uniform prior distribution for the spreads, and the procedure Psi θ(σ) is also a good candidate for 4AFC tasks if the experimenter is only interested in the threshold. Fig 13 shows the simulations results presented in Fig 12 but from trials 100 to 500 to easily compare the methods for a high number of trials. Results show that threshold bias is negligible from 200 trials being stronger for Psi (blue line) and Psi-marginal (cyan and pink line) in 2AFC and ZEST 2D θ,σ (red line) for 4AFC. In terms of standard deviation, for 2AFC, our procedure ZEST optimal σ and ZEST 2D θ show the smallest standard deviation followed by Psi-methods and ZEST 2D θ,σ. For 4AFC, again, ZEST optimal σ and ZEST 2D θ show the smallest standard deviation. Psi θ,(σ,λ) (cyan line) shows very similar standard deviation between 100 and 300 trials. In general, for 4AFC, both versions of Psi-marginal shown here, work better than the classic Psi method. It is important to note that ZEST optimal σ still shows smaller bias and standard deviation than the Psi-marginal procedures tested here because we are using the optimal spread for the population, however, these Psi-marginal procedures are very good when we have no idea about the distribution of slopes, and we can run more than 100 trials in the experiment. Finally, ZEST 2D θ,σ, showed smaller bias than Psi procedures although higher standard deviation.

Discussion
Parametric procedures like Bayesian staircases require a priori decisions about different parameters of the assumed psychometric function or model function (e.g. shape, slope, convergence probability and lapse rate). Simulation studies have shown that mismatches between Efficient estimation of stereo thresholds the subject psychometric function and the model function produce threshold bias [3][4][5]. In this study, our aim was to find the optimal spread value (inversely related to the slope) of the model function in order to estimate disparity thresholds with the smallest bias and standard deviation. In particular, we wanted to find this value for measuring disparity thresholds for global stereopsis (or cyclopean) and local stereopsis (feature or contour-based). We also wanted to know if this value changes with the psychophysical task (i.e. 2AFC vs. 4AFC). To achieve this, we needed to know the distribution of real slopes (i.e. spreads) for the general population for these tasks and types of stereopsis. Therefore, we ran three experiments, 2AFC_g with global stereopsis, 4AFC_g with global stereopsis, and 4AFC_l with local stereopsis where we tested in total 260 participants. For each participant we ran an adaptive weighted one-up one-down staircase procedure with 80-120 trials and we fitted a logistic psychometric function (described in the S1 Appendix (section A1)) to the probabilities of correct detection in order to estimate stereoacuity thresholds, spread, and lapse rates. Results from these experiments are shown in Figs 4, 5 & 6. We found that disparity thresholds were around 1.5 arcsec for all experiments. With regard to the spread values (inversely related with the slope), we found differences between 2AFC_g and 4AFC_l, with spread values smaller for 2AFC_g. No significant differences were found between the two 4AFC experiments. We also found that for the three experiments, the spread values did not correlate with age. Finally, an analysis of the lapse rates showed a similar probability of making a lapse in all experiments (between 0.02 and 0.03, a percentage of 2-3%).
We have developed a new method in order to estimate the optimal slope or spread (see Fig  7), where we simulate millions of Bayesian staircases using the empirical spreads, thresholds and lapse rates of our participants as model subject's psychometric functions. The spreads used in the assumed psychometric function or model function corresponded to percentiles (between 5 and 95% in steps of 5%) of the empirical distributions of spreads. Fig 8 shows the results of the simulations. These results show a clear pattern: bias and errors (i.e. standard deviation) are reduced when the spread value is located between the percentile 60-80% of the distributions of spreads. These percentiles work for three different distributions of spreads. We simulated two conditions, one where the model lapse rate was the same as the subject, and the other (the more realistic one) where the lapse rate was fixed to a particular value. We found that the two conditions differ only when the spread value is very small; however, they are very similar when the spreads are higher than the 60% percentile of the distribution of spreads. Thus, for the first time, we have been able to estimate the optimal spread values for different types of stereopsis (e.g global and local) and tasks (e.g. 2AFC and 4AFC). These values can be used with Bayesian staircases to estimate thresholds with the smallest bias and standard deviation independently of the age of the participant.
It is important to note that there are Bayesian methods that can be used to estimate the threshold and the slope simultaneously, or just the threshold but leaving the slope as a nuisance parameter [11][12][13][14][15]. Here, running Monte Carlo simulations with the empirical data as modelled subjects, we have compared our suggested procedure (ZEST using the optimal slope, "ZEST optimal σ") with six Bayesian methods, two based on the ZEST 2D [12], the Psi method [13], and three configurations of Psi-marginal [14]. Our simulation results (see Figs 11 and 12) show that, in general, ZEST optimal σ has lower threshold bias and standard deviation for 2AFC and 4AFC tasks than the rest of the Bayesian procedures tested. This difference in threshold bias is very pronounced at small number of trials. For example, Figs 10 (4AFC) and 12 (4AFC) shows that Psi methods have a strong bias for small number of trials (between 20 and 30), meaning that they systematically tend to overestimate thresholds by a factor of 1.38-1.45 (e.g. a threshold of 100 arcsec will be incorrectly estimated as 145 arcsec). In addition, the standard deviation for 20-30 trials is 0.04-0.02, corresponding to multiplying or dividing by a factor of 1.05-1.1. For Zest optimal σ in the range 20-30 trials, the bias factor is 0.998 and the standard deviation range is 0.017-0.013, that corresponds to a factor of 1.03-1.04. We have to acknowledge that the simulations were performed using the same distributions that were used to estimate the optimal slope, so it is expected to behave with the smallest threshold bias, however, it is also expected that is the distribution of spreads is approximately known, the use of ZEST with the optimal σ will produce unbiased thresholds.
Thus, if the number of trials used for estimating the threshold is small (< 30), we recommend using ZEST optimal σ. This procedure is currently being used in a 3D software dedicated to estimate the stereoacuity (ASTEROID, [31,32]). However, if the optimal slope is unknown, and the experimenter assumes a uniform distribution for the slope, our results shows that the procedure ZEST 2D θ is a very good candidate followed by Psi-marginal with the configuration Psi θ(σ) or Psi θ,(σ,λ).
Our recommendation for choosing the optimal spread is to overestimate the subject's spread (i.e. underestimate the subject's slope). If the distribution of spreads is known, choose a spread value between the percentile 60 and 80% approximately. We have shown that these percentiles work for three different experiments with different distributions of spreads, thresholds, and lapse rates, thus, our results could be extended to other type of thresholds if the distributions of spreads are known. However, we recommend using the procedure described in our Fig 7 to get more precision. For our particular stereoacuity experiments, we recommend using ZEST (as described in S1 Appendix (section A2)), assume a cumulative logistic distribution for the model function, use fixed lapse rates, and the spread value of the model function around the value σ M = 1.7 log 10 arcsec (corresponding to a factor of 50 in arcsec, or a slope of β M = 4.3 /log 10 arcsec) for 2AFC experiments, and around σ M = 1.5 log 10 arcsec (a factor of 32 in arcsec, β M = 4.9 /log 10 arcsec) for 4AFC.

Limitations of the study
These recommendations are of course not perfect; for example, they neglect the possible effects of changes in an observer's psychometric function during the course of the experiment, or spatial biases towards a particular location. Further work would be needed to assess these. Another limitation of our study is that for the simulations we are only using three distributions of spreads (also three distributions of thresholds and lapse rates), obtained with a psychophysical method that is far from perfect (i.e. more trials would be needed to obtain a better estimation of the parameters, increasing the range of age of the participants would be recommended). Therefore, these distributions do not reflect the exact distributions of spreads, thresholds, and lapse rates. However, given that simulations for the three experiments shows similar results our recommendations would be useful in this field in order to estimate stereo thresholds. Future directions should include simulations assuming different shapes for the distribution of spreads in combination with different distributions of thresholds and lapse rates, will give a more precise answer and more generalizable recommendations to other visual modalities.
Supporting information S1 Appendix. Description of the psychometric function used in this research and the Bayesian procedures used in the simulations. (PDF) S1 Dataset. The data presented in this paper. (XLSX)