Unbiased and Efficient Log-Likelihood Estimation with Inverse Binomial Sampling

The fate of scientific hypotheses often relies on the ability of a computational model to explain the data, quantified in modern statistical approaches by the likelihood function. The log-likelihood is the key element for parameter estimation and model evaluation. However, the log-likelihood of complex models in fields such as computational biology and neuroscience is often intractable to compute analytically or numerically. In those cases, researchers can often only estimate the log-likelihood by comparing observed data with synthetic observations generated by model simulations. Standard techniques to approximate the likelihood via simulation either use summary statistics of the data or are at risk of producing severe biases in the estimate. Here, we explore another method, inverse binomial sampling (IBS), which can estimate the log-likelihood of an entire data set efficiently and without bias. For each observation, IBS draws samples from the simulator model until one matches the observation. The log-likelihood estimate is then a function of the number of samples drawn. The variance of this estimator is uniformly bounded, achieves the minimum variance for an unbiased estimator, and we can compute calibrated estimates of the variance. We provide theoretical arguments in favor of IBS and an empirical assessment of the method for maximum-likelihood estimation with simulation-based models. As case studies, we take three model-fitting problems of increasing complexity from computational and cognitive neuroscience. In all problems, IBS generally produces lower error in the estimated parameters and maximum log-likelihood values than alternative sampling methods with the same average number of samples. Our results demonstrate the potential of IBS as a practical, robust, and easy to implement method for log-likelihood evaluation when exact techniques are not available.


Introduction
The likelihood function is one of the most important mathematical objects for modern statistical inference. Briefly, the likelihood function measures how well a model with a given set of parameters can explain an observed data set. For a data set of discrete observations, the likelihood has the intuitive interpretation of the probability that a random sample generated from the model matches the data, for a given setting of the model parameters.
In many scientific disciplines, such as computational neuroscience and cognitive science, computational models are used to give a precise quantitative form to scientific hypotheses and theories. Statistical inference then plays at least two fundamental roles for scientific discovery. First, our goal may be parameter estimation for a model of interest. Parameter values may have a significance in themselves, for example we may be looking for differences in parameters between distinct experimental conditions in a clinical or behavioral study. Second, we may be considering a number of competing scientific hypotheses, instantiated by different models, and we want to evaluate which model 'best' captures the data according to some criteria, such as explanation (what evidence the data provide in favor of each model?) and prediction (which model best predicts new observations?).
Crucially, the likelihood function is a key element for both parameter estimation and model evaluation. A principled method to find best-fitting model parameters for a given data set is maximum-likelihood estimation (MLE), which entails optimizing the likelihood function over the parameter space (Myung, 2003). Other common parameter estimation methods, such as maximum-a-posteriori (MAP) estimation or full or approximate Bayesian inference of posterior distributions, still involve the likelihood function (Gelman et al., 2013). Moreover, almost all model comparison metrics commonly used for scientific model evaluation are based on likelihood computations, from predictive metrics such as Akaike's information criterion (AIC; Akaike, 1974), the deviance information criterion (DIC; Spiegelhalter et al., 2002), the widely applicable information criterion (WAIC; Watanabe, 2010), leave-one-out cross-validation (Vehtari et al., 2017); to evidence-based metrics such at the marginal likelihood (MacKay, 2003) and (loose) approximations thereof, such as the Bayesian information criterion (BIC; Schwarz et al., 1978) or the Laplace approximation (MacKay, 2003).
However, many complex computational models, such as those developed in computational biology (Pritchard et al., 1999;Ratmann et al., 2007;Wilkinson, 2011), neuroscience (Pospischil et al., 2008;Sterratt et al., 2011) and cognitive science (van Opheusden et al., 2016), take the form of a generative model or simulator, that is an algorithm which given some context information and parameter settings returns one or more simulated observations (a synthetic data set). In those cases, the likelihood is often impossible to calculate analytically, and even when the likelihood might be available in theory, the numerical calculations needed to obtain it might be overwhelmingly expensive and intractable in practice. In such situations, the only thing one can do is to run the model to simulate observations ('samples'). In the absence of a likelihood function, common approaches to 'likelihood-free inference' generally try and match summary statistics of the data with summary statistics of simulated observations (Beaumont et al., 2002;Wood, 2010).
In this paper, we ask instead the question of whether we can use samples from a simulator model to directly estimate the likelihood of the full data set, without recurring to summary statistics, in a 'correct' and 'efficient' manner, for some specific definition of these terms. The answer is yes, as long as we use the right sampling method.
In brief, a sampling method consists of a 'sampling policy' (a rule that determines how long to keep drawing samples for) and an 'estimator' which converts the samples to a real-valued number. To estimate the likelihood of a single observation (e.g., the response of a participant on a single trial of a behavioral experiment), the most obvious sampling policy is to draw a fixed amount of samples from the simulator model, and the simplest estimator is the fraction of samples that match the observation (or is 'close enough' to it, for continuous observations). However, most basic applications, such as computing the likelihood of multiple observations, require one to estimate the logarithm of the likelihood, or log-likelihood (see Section 2.3 for the underlying technical reasons). The 'fixed sampling' method described above cannot provide unbiased estimates for the log-likelihood (see Section 3). Such bias vanishes in the asymptotic limit of infinite samples, but drawing samples from the model can be computationally expensive, especially if the simulator model is complex. In practice, the bias introduced by any fixed sampling method can translate to considerable biases in estimates of model parameters, or even reverse the outcome of model comparison analyses. In other words, using poor sampling methods can cause researchers to draw conclusions about scientific hypotheses which are not supported by their data.
In this work, we introduce inverse binomial sampling (IBS) as a powerful and simple technique for correctly and efficiently estimating log-likelihoods of simulator-based models. Crucially, IBS is a sampling method that provides uniformly unbiased esti-mates of the log-likelihood (Haldane, 1945b;de Groot, 1959) and calibrated estimates of their variance, which is also uniformly bounded.
We note that the problem of estimating functions f (p) from observations of a Bernoulli distribution with parameter p has been studied for mostly theoretical reasons in the mid-20th century, with major contributions from Haldane (1945a,b), Girshick et al. (1946), Dawson (1953) and de Groot (1959). These works have largely focused on deriving the set of functions f (p) for which an unbiased estimate exists, and demonstrating that for those functions, the inverse sampling policy (see Section 2.4) is in a precise sense 'efficient'. Our main contribution here is to demonstrate that inverse binomial sampling provides a practically and theoretically efficient solution for a common problem in computational modeling; namely likelihood-free inference of complex models. To back up our claims, we provide theoretical arguments for the efficiency of IBS and a practical demonstration of its value for log-likelihood estimation and fitting of simulation-based models, in particular those used in computational cognitive science. 1 The paper is structured as follows. After setting the stage with useful definitions and notation (Section 2), we describe more in detail the issues with the fixed sampling method and why they cannot be fixed (Section 3). We then present a series of arguments for why IBS solves these issues, and in particular why being unbiased here is of particular relevance (Section 4). Then, we present an empirical comparison of IBS and fixed sampling in the setting of maximum-likelihood estimation (Section 5). As case 1 At time of submission, we became aware of a workshop paper with a similar goal of proposing inverse binomial sampling as a method for likelihood-free inference in certain econometric models (Duncan, 2004). However, the paper does not present an empirical assessment of the quality of the estimation and to our knowledge has not led to further adoption of IBS. 6 studies, we take three model-fitting problems of increasing complexity from computational cognitive science: an 'orientation discrimination' task, a 'change localization' task, and a complex sequential decision making task. In all problems, IBS generally produces lower error in the estimated parameters than fixed sampling with the same average number of samples. IBS also returns solutions that are very close in value to the true maximum log-likelihood. We conclude by discussing further applications and extensions of IBS (Section 6). Our theoretical analyses and empirical results demonstrate the potential of IBS as a practical, robust, and easy-to-implement method for log-likelihood evaluation when exact or numerical solutions are unavailable.
Implementations of IBS with tutorials and examples are available at the following link: https://github.com/lacerbi/ibs.

Definitions and notation
The two fundamental ingredients to run IBS are: consisting of N 'trials' characterized by 'stimuli' s i and discrete 'responses' r i .
2. A generative model g for the data (also known as a 'simulator'): a stochastic function that takes as input a stimulus s and a parameter vector θ (and possibly other information) and outputs a response r.
In this section we expand on and provide motivations for the above assumptions, and introduce related definitions and notation used in the rest of the paper.
Here and in the following, for ease of reference, we use the language of behavioral and cognitive modeling (e.g., 'trial' for data points, 'stimulus' for independent or contextual variables, 'response' for observations or outcomes), but the statistical techniques that we discuss in the paper apply to any model and data set from any domain as long as they satisfy the fundamental assumption of IBS delineated above.

The likelihood function
We assume that we want to model a data set D = {(s i , r i )} N i=1 consisting of N 'trials' (data points), where • s i is the stimulus (i.e., the experimental context, or independent variable) presented on the i-th trial; typically, s i is a scalar or vector of discrete or continuous variables (more generally, there are no restrictions on what s i can be as long as the simulator can accept it as input); • r i is the response (i.e., the experimental observations, outcomes, or dependent variables) measured on the i-th trial; r i can be a scalar of vector, but crucially we assume it takes discrete values.
The requirement that r i be discrete will be discussed below, in Section 2.2.
Given a data set D, and a model parametrized by parameter vector θ, we can write the likelihood function as Pr (r i |r 1 , . . . , r i−1 , s 1 , . . . , s N , θ) Pr (r i |r 1 , . . . , r i−1 , s 1 , . . . , s i , θ) 8 where the first line follows from the chain rule of probability, and holds in general, whereas in the second step we applied the reasonable 'causal ' (or 'no-time-travel') assumption that the response at the i-th trial is not influenced by future stimuli. 2 Equation 1 describes the most general class of models, in which the response in the current trial might be influenced by the history of both previous stimuli and previous responses. Many models commonly make a stronger conditional independence assumption between trials, such that the response on the current trial only depends on the current stimulus. Under this stronger assumption, the likelihood takes a simpler form, While Equation 2 is simpler, it still includes a wide variety of models. For example, note that time-dependence can be easily included in the model by incorporating time into the 'stimulus' s, and including time-dependent parameters explicitly in the model specification. In the rest of this work, for simplicity we consider models that make conditional independence assumptions as in Equation 2, but our techniques apply in general also for likelihoods as per Equation 1. Given that the likelihood of the i-th trial can be directly interpreted as the probability of observing response r i in the i-th trial (conditioned on everything else), we denote such quantity with p i ∈ [0, 1]. The value p i is a function of θ, depends on the current stimulus and response, and may or may not depend on previous stimuli or responses.
2 We also used the causality assumption that current responses are not influenced by future responses to choose a specific order to apply the chain rule in the first line.
With this notation, we can simply write the likelihood as Finally, we note that for numerical convenience, given that N can be large (in the hundreds if not thousands or more), it is common practice to work with the logarithm of the likelihood, or log-likelihood, that is Crucially, we assume that the likelihood function is unavailable in a tractable form -for example, because the model is too complex to derive an analytical expression for the likelihood. Instead, IBS provides a technique for estimating Equation 4 via simulation.

The generative model or simulator
While we assume no availability of an explicit representation of the likelihood function, we assume that the model of interest is represented implicitly by a stochastic generative model (or 'simulator'). In the most general case, the simulator is a stochastic function g that takes as input the current stimulus s i , arrays of past stimuli and responses, and a parameter vector θ, and outputs a discrete response r i , conditional on all past events, As mentioned in the previous section, a common assumption for a model is that the response in the current trial only depends on the current stimulus and parameter vector, in which case r ∼ g(s; θ).
For example, the model g(·) could be simulating the responses of a human participant in a complex cognitive task; the (discrete) choices taken by a rodent in a perceptual decision-making experiment; or the spike count of a neuron in sensory cortex for a specific time bin after a stimulus presentation.
We list now the requirements that the simulator model needs to satisfy to be used in conjuction with IBS.

Discrete response space
Lacking an expression for the likelihood function, the only way to estimate the likelihood or any function thereof is by drawing samples r ∼ g (s i , . . . ; θ) on each trial, and matching them to the response r i . This approach requires that there is a nonzero probability for a random sample r to match r i , hence the assumption that the space of responses is discrete. We will discuss in Section 6.3 a possible method to extend IBS to larger or continuous response spaces.

Conditional simulation
An important requirement of the generative model, stated implicitly by Equations 5 and 6, is that the simulator should afford conditional simulation, in that we can simulate the response r i for any trial i, given the current stimulus s i , and possibly previous stimuli and responses. Note that this class of models, while large, does not include all possible simulators, in that some simulators might not afford conditional generation of responses. For example, models with latent dynamics might be able to generate a full sequence of responses given the stimuli, but it might not be easy (or even possible) to generate the response in a given trial, conditional on a specific sequence of previous responses.

Computational cost
Finally, for the purpose of some of our analyses we assume that drawing a sample from the generative model is at least moderately computationally expensive, which limits the approximate budget of samples one is willing to use for each likelihood evaluation (in our analyses, up to about a hundred, on average, per likelihood evaluation). Number of samples is a reasonable proxy for any realistic resource expenditure since most costs (e.g., time, energy, number of processors) would be approximately proportional to it.
Therefore, we also require that every response value in the data has a non-negligible probability of being sampled from the model -given the available budget of samples one can reasonably draw. In this paper, we will focus on the low-sample regime, since that is where IBS considerably outperforms other approaches. For our analyses of performance of the algorithm, we also assume that the computational cost is independent of the stimulus, response or model parameters, but this is not a requirement of the method.

Reduction to Bernoulli sampling
Given the conditional independence structure codified by Equation 3, to estimate the log-likelihood of the entire data set, we cannot do better than estimating p i on each trial independently, and combining the results. However, combining estimatesp i into an estimate of N i=1 p i is non-trivial. Instead, it is easier to estimate L i ≡ log p i for each trial and calculate the log-likelihood We can estimate this log-likelihood by simply summing estimatesL i across trials, in which case the central limit theorem guarantees that the distribution ofL (θ) is normally distributed for large values of N , which is true for typical values of N of the order of a hundred or more (see also Section 4.6).
We can make one additional simplification, without loss of generality. The generative model specifies an implicit probability distribution r i ∼ g(s i , . . . ; θ) for each trial. However, to estimate the log-likelihood, we do not need to know the full distribution, only the probability for a random sample r from the model to match the observed response r i . Therefore, we can convert each sample r to and lose no information relevant for estimating the log-likelihood. By construction, x follows a Bernoulli distribution with probability p i . Note that this holds regardless of the type of data, the structure of the generative model or the model parameters. The only difference between different models and data sets is the distribution of the likelihood p i across trials. Moreover, since p i is interpreted as the parameter of a Bernoulli distribution, we can apply standard frequentist or Bayesian statistical reasoning to it.
In conclusion, we can reduce the problem of estimating the log-likelihood of a given model by sampling to a smaller problem: given a method to draw samples (x 1 , x 2 , . . . ) from a Bernoulli distribution with unknown parameter p, estimate log p as precisely and accurately as possible using on average as few samples as possible.

Sampling policies and estimators
A sampling policy is a function that, given a sequence of samples x ≡ (x 1 , x 2 , . . . , x k ), decides whether to draw an additional sample or not (Girshick et al., 1946). In this work, we compare two sampling policies: 1. The commonly used fixed policy: Draw a fixed number of samples M , then stop.
2. The inverse binomial sampling policy: Keep drawing samples until x k = 1, then stop.
In our case, an estimator (of log p) is a functionL(x) that takes as input a sequence of samples x = (x 1 , x 2 , . . . , x k ) and returns an estimate of log p. We recall that the bias of an estimatorL of log p, for a given true value of the Bernoulli parameter p, is defined as where the expectation is taken over all possible sequences x generated by the chosen sampling policy under the Bernoulli probability p. Such estimator is (uniformly) unbiased if Bias L |p = 0 for all 0 < p ≤ 1 (that is, the estimator is centered around the true value).

Fixed sampling
For the fixed sampling policy, since all samples are independent and identically distributed, a sufficient statistic for estimating p from the samples (x 1 , x 2 , . . . , x M ) is the x n . The most obvious estimator for an obtained sequence 14 of samples x is thenL but this estimator has infinite bias; since as long as p = 1, there is always a nonzero chance that m(x) = 0, in which caseL naive (x) = −∞ (and thus E L naive = −∞).
This divergence can be fixed in multiple ways; in the main text we usê Note that any estimator based on the fixed sampling policy will always produce biased estimates of log p, as guaranteed by the reasoning in Section 3 below. As an empirical validation, we show in Appendix B.1 that our results do not depend on the specific choice of estimator for fixed sampling (Equation 11).

Inverse binomial sampling
For inverse binomial sampling we note that, since x is a binary variable, the policy will always result in a sequence of samples of the form where the length of the sequence is a stochastic variable, which we label K (a positive integer). Moreover, since each sample is independent and a 'hit' with probability p, the length K follows a geometric distribution with parameter 1 − p, We convert a value of K into an estimate for log p using the IBS estimator, Crucially, Equation 14 combined with the IBS policy provides a uniformly unbiased estimator of log p (de Groot, 1959). Moreover, we can show thatL IBS is the uniformly minimum-variance unbiased estimator of log p under the IBS policy. For a full derivation of the properties of the IBS estimator, we refer to Appendix A.1. Equation 14 can be written compactly asL IBS (K) = ψ(1) − ψ(K), where ψ(z) is the digamma function (Abramowitz and Stegun, 1948).
We now provide an understanding of why fixed sampling is not a good policy, despite its intuitive appeal, and then show why IBS solves many of the problems with fixed sampling.

Why fixed sampling fails
We summarize in Figure 1 the properties of the IBS estimator and of fixed sampling, for different number of samples M , as a function of the trial likelihood p. In particular, we plot the expected number of samples, the bias, and the standard deviation of the estimators.
The critical disadvantage of the fixed sampling policy with M samples is that its estimates of the log-likelihood are inevitably biased (see Figure 1B). Fixed sampling is 'inevitably' biased because the bias decreases as one takes more samples, but for p → 0, the estimator remains biased. More precisely, in a joint limit where M → ∞, p → 0 and pM → λ for some constant λ, the bias collapses onto a single 'master curve' (see Figure 2; and Appendix A.2 for the derivation). In particular, we observe that the bias is close to zero for λ 1 and that it diverges when λ 1, or equivalently, for M To convey the intuition for why the bias diverges for small probabilities, we provide a gambling analogy. Imagine playing a slot machine and losing the first 100 bets you make. You can now deduce that this slot machine likely has a win rate less than 1%, but there is no way of knowing whether it is 1%, 0.1%, 0.01% or even 0% apart from any prior beliefs you may have (for example, you expect that the house has stacked the odds in their favor but not overwhelmingly so). In practice, this uncertainty is unlikely to affect your decision whether to continue playing the slot machine, since the expected value of the slot machine depends linearly on its win rate. However, if your goal is to estimate the logarithm of the win rate, the difference between these percentages becomes infinitely large as the true win rate tends to 0. We provide a more formal treatment of the bias of fixed sampling in Appendix A.2.

Why fixed sampling cannot be fixed
The asymptotic analyses above suggest an obvious solution to prevent fixed sampling estimators from becoming strongly biased: make sure to draw enough samples so that Although this solution will succeed in theory, it has practical issues. First of all, choosing M requires knowledge of p i on each trial, which is equivalent to the problem we set out the solve in the first place. Moreover, even if one can derive or estimate an upper bound on 1 p i (for example, in behavioral models that include a lapse rate, that is a nonzero probability of giving a uniformly random response), fixed sampling will be inefficient. As shown in Figure 2, the bias inL fixed is small when λ ≈ 1 or M ≈ 1 p and increasing M even further has diminishing returns, at least for the purpose of reducing bias. If we choose M inversely proportional to the probability p i on the trial where the model is least likely to match the observed response, we will draw many more samples than necessary for all other trials.
One might hope that in practice the likelihood p i is approximately the same across trials, but the opposite is true. As an example, take a typical 'orientation discrimination' psychophysical task in which a participant has to detect whether a presented oriented grating is tilted clockwise or anti-clockwise from vertical, and consider a generative model for the observer's responses that includes sensory measurement noise and lapses (see Section 5.2 for details). Moreover, imagine that the experiment contains ≈ 500 trials, and the participant's true lapse rate is 1%. The model will always assign more probability to correct responses than errors, so, for all correct trials, p i will be at least 0.5. However, there will likely be a handful of trials where the participant lapses and makes a grave error (responding incorrectly to a stimulus very far from the decision boundary), in which case p i will be 0.5. This hypothetical scenario is not exceptional, in fact it is almost inevitable in any experiment where participants occasionally make unpredictable responses, and perform hundreds or more trials.
A more sophisticated solution would relax the assumption that M needs to be constant for all trials, and instead choose M as a function of p i on each trial. However, since p i is unknown, one would need to first estimate p i by sampling, choose M i for each trial, then re-estimate L i . Such an iterative procedure would create a non-fixed sampling scheme, in which M i adapts to p i on each trial. This approach is promising, and it is, in fact, how we originally arrived at the idea of using inverse binomial sampling for log-likelihood estimation, while working on the complex cognitive model described in Section 5.4.
Finally, a heuristic solution would be to disregard any statistical concerns, pick M based on some intuition or from similar studies in the literature, and hope that the bias turns out to be negligible. We do not intend to dissuade researchers from using such pragmatic approaches if they work in practice. Unfortunately, this one does not. As Figure 2 shows, estimating log-likelihoods with fixed sampling can cause biases of 1 or more points of model evidence if the data set contains even a single trial on which p i ≤ 1 2M . Since differences in log-likelihoods larger than 5 to 10 points are often regarded as strong evidence for one model over another (Kass and Raftery, 1995;Jeffreys, 1998;Anderson and Burnham, 2002), it is well possible for such biases to reverse the outcome of a model comparison analysis. This point bears repeating; if one uses fixed sampling to estimate log-likelihoods and the number of samples is too low, one risks of drawing conclusions about scientific hypotheses that are not supported by the experimental data one has collected.

Is inverse binomial sampling really better?
While one could expect that the unbiasedness of the IBS estimator would come at a cost, such as more samples, a much higher variance, or perhaps a particularly complex implementation, we show here that IBS is not only unbiased, but it is sample-efficient, its estimates are low-variance, and can be implemented in a few lines of code.

Implementation
We present in Algorithm 1 a description in pseudo-code of the basic IBS procedure to estimate the log-likelihood of a given parameter vector θ for a given data set and generative model. The procedure is based on the inverse binomial sampling scheme 20 introduced in Section 2.4, generalized sequentially to multiple trials.
For each trial, we draw sampled responses from the generative model, given the stimulus s i in that trial, using the subroutine sample from model, until one matches the observed response r i . This yields a value of K i on each trial i, which IBS converts to an estimateL i (where we use the convention that a sum with zero terms equals 0). We make our way sequentially across all trials, returning then the summed log-likelihood estimateL IBS for the entire data set.
Algorithm 1 Inverse Binomial Sampling (sequential implementation) while sample from model(M,θ,s i ) = r i do 4: Return total log-likelihood estimate In practice, depending on the programming language of choice, it might be useful to take advantage of numerical features such as vectorization to speed up computations.
An alternative 'parallel' implementation of IBS is described in Appendix C.1. Implementations of IBS in different programming languages can be found at the following web page: https://github.com/lacerbi/ibs.

Computational time
The number of samples that IBS takes on a trial with probability p i is geometrically distributed with mean 1 p i . We saw earlier that for fixed-sampling estimators to be approximately unbiased, one needs at least 1 p i samples, and IBS does exactly that. Moreover, since IBS adapts the number of samples it takes on different trials, it will be considerably more sample-efficient than fixed sampling with constant M across trials. For example, in the aforementioned example of the orientation discrimination task, when most trials have a likelihood p i ≥ 0.5, IBS will often take just 1 or 2 samples on those trials. Therefore, it will allocate most of its samples and computational time on trials where p i is low and those samples are needed.

Variance
The derivation of the variance of the IBS estimator is similar to the calculation of its expected value (see Equation 25 in Appendix A.1), which yields where we introduced the dilogarithm or Spence's function Li 2 (z) (Maximon, 2003).
The variance (plotted in Figure 1C as standard deviation) increases when p → 0, but it does not diverge; instead, it converges to π 2 6 . Therefore, IBS is not only uniformly unbiased, but its variance is uniformly bounded.
We already mentioned thatL IBS is the minimum-variance unbiased estimator of log p given the inverse binomial sampling policy, but it also comes close (less than ∼ 30% distance) to saturating the information inequality, which specifies the minimum variance that can be theoretically achieved by any estimator under a non-fixed sampling policy (an analogue of the Cramer-Ráo bound; de Groot, 1959). We note that fixed sampling eventually saturates the information inequality in the limit M → ∞, but as mentioned in the previous section, the fixed-sampling approach can be highly wasteful or substantially biased (or both), not knowing a priori how large M has to be across trials. See Appendix A.3 for a full discussion of the information inequality and comparison between estimators.
Equation 15 has theoretical relevance, but requires us to know the true value of the likelihood p, which is unknown in practice. Instead, we define the estimator of the variance of a specific IBS estimate, having sampled for K times until a 'hit', as where ψ 1 (z) is the trigamma function, the derivative of the digamma function (Abramowitz and Stegun, 1948). We derived Equation 16 from a Bayesian interpretation of the IBS estimator, which can be found in Appendix A.4.

Iterative multi-fidelity
We define a multi-fidelity estimator as an estimator with a tunable parameter that affords different degrees of precision at different computational costs (i.e., from a cheaper, inaccurate estimate to a very accurate but expensive one), borrowing the term from the literature on computer simulations and surrogate models (Kennedy and O'Hagan, 2000;Forrester et al., 2007). IBS provides a particularly convenient way to construct an iterative multi-fidelity estimator in that we can perform R independent 'repeats' of the IBS estimate at θ, and combine them by averaging, Importantly, we do not need to perform all R repeats at the same time, but we can iteratively refine our estimates whenever needed, and only need to store the current estimate, its variance and the number of repeats performed so far: Crucially, while a similar procedure could be performed with any estimator (including fixed sampling), the fact that IBS is unbiased and its variance is bounded ensures that the combined iterative estimator is also unbiased and eventually converges to the true value for R → ∞, with variance bounded above by π 2 6R .
Finally, we note that the iterative multi-fidelity approach described in this section can be extended such that, instead of having the same number of repeats R for all trials, one could adaptively allocate a different number of repeats R i to each trial so as to minimize the overall variance of the estimated log-likelihood (see Appendix C.2).

Bias or variance?
In the previous sections, we have seen that IBS is always unbiased, whereas fixed sampling can be severely biased when using too few samples. However, with the right choice of M , fixed sampling can have slightly lower variance. We now list several practical and theoretical arguments for why bias matters more than variance, and being unbiased is a highly desirable property for an estimator of the log-likelihood.
1. To use IBS or fixed sampling to estimate the log-likelihood of a given data set, we sum estimates of L i across trials. Basic rules of probability imply that, as N → ∞, the standard deviation ofL (θ) will grow proportional to √ N , whereas the bias grows linearly with N .
2. When using the log-likelihood (or a derived metric) for model selection, it is common to collect evidence for a model, possibly hierarchically, across multiple datasets (e.g., different participants in a behavioral experiment), which provides a second level of averaging that can reduce variance but not bias.
3. Besides model selection, the other key reason to estimate log-likelihoods is to infer parameters of a model, for example via maximum-likelihood estimation. For this purpose, one would use an optimization algorithm that calls the routine that estimatesL (θ) many times with different candidate values of θ, and uses this information to estimate the value that maximizes L (θ). Powerful, sample-efficient optimization algorithms, such as those based on Bayesian optimization, work by building a statistical approximation (a surrogate) of the objective function (Jones et al., 1998;Snoek et al., 2012;Shahriari et al., 2015;Acerbi and Ma, 2017), most commonly via Gaussian processes (Rasmussen and Williams, 2006). These methods can operate successfully with noisy objectives by effectively averaging function values from nearby parameter vectors. By contrast, no optimization algorithm can handle bias. This argument is not limited to maximum-likelihood estimation, as recent methods have been proposed to use Gaussian process surrogates to perform (approximate) Bayesian inference and infer posterior distributions (Kandasamy et al., 2015;Acerbi, 2018;Järvenpää et al., 2019); also these techniques can handle variance in the estimates but not bias.
4. The ability to combine unbiased estimates of known variance iteratively (as described in Section 4.4) is particularly useful with adaptive fitting methods based on Gaussian processes, whose algorithmic cost grows super-linearly in the number of distinct training points (Rasmussen and Williams, 2006). Thanks to iterative multi-fidelity estimation, these methods would have the opportunity to refine their estimates of the log-likelihood at a previously evaluated point, whenever deemed useful, without incurring an increased algorithmic cost. 5. On a conceptual level, bias is much more dangerous than variance. Bias can cause researchers to confidently draw false conclusions, variance causes decreased statistical power and lack of confidence. Appropriate statistical tools can account for variance and explain seemingly conflicting findings resulting from underpowered studies (Maxwell et al., 2015), whereas bias is impossible to recognize or correct no matter what statistical techniques one uses.

Higher-order moments
So far, we have considered the mean (or bias) and variance ofL fixed andL IBS in detail, but ignored any higher-order moments. This is justified since to estimate the loglikelihood of a model with a given parameter vector, we will sum these estimates across many trials. Therefore, the central limit theorem guarantees that the distribution of L (θ) is Gaussian with mean and variance determined by the mean and variance of L fixed orL IBS on each trial, at least as long as the distribution of p i across trials satisfies a regularity condition. 3 A sufficient but far from necessary condition is that there exists a lower bound on p i , which is the case for example for a behavioral model with a lapse rate. Using the same argument, the total number of samples K tot that IBS uses to estimate L (θ) is also approximately Gaussian.
In the following, we demonstrate empirically that the distributions of the number of samples taken by IBS and of the estimatesL IBS are Gaussian. Importantly, we also show that the estimate of the variance from Equation 16 is calibrated. As a realistic scenario, we consider the psychometric function model described in Section 5.2. For each simulated data set, we estimated the log-likelihood under the true data-generating parameters θ true (see Appendix B.1 for details). Specifically, for each data set we ran IBS and recorded the estimated log-likelihoodL IBS , the total number of samples K tot taken, and a Bayesian estimate for the variance ofL IBS from Equation 16. For the total number of samples K tot and theL IBS estimate, we can compute the theoretical mean 3 Specifically, the Lindeberg (1922) or Lyapunov conditions (Ash et al., 2000, Chapter 7. is calibrated. and variance by knowing the trial likelihoods p i , which we can evaluate exactly in this example. For each obtained K tot , we computed a z-score by subtracting the exact mean and dividing by the exact standard deviation, obtained by knowing the mean and variance of geometric random variables underlying the samples taken in each trial. If K tot is normally distributed, we expect that the variable z across data sets should appear to be distributed as a standard normal, z ∼ N (0, 1). If K tot is not normally distributed, we should see deviations from normality in the distribution of z, especially in the tails. By comparing the histogram of z-scores with a standard normal in Figure 3A, we see that the total number of samples is approximately normal.
We did the same analysis for the estimateL IBS , using the z-scored variable where here Var[L IBS ] is the exact variance of the estimator computed via Equation 15.
The histogram of z-scores in Figure 3B is again very close to a standard normal.
Finally, in practical scenarios we do not know the true likelihoods, so the key ques-  Figure 3C shows an excellent match with a standard normal, confirming that our proposed estimator of the variance is well calibrated.

Numerical experiments
In this section, we examine the performance of IBS and fixed sampling on several realistic model-fitting problems of increasing complexity. The example problems we consider here model tasks drawn from psychophysics and cognitive science: an orientation discrimination experiment (Section 5.2); a change localization task (Section 5.3); and playing a four-in-a-row game that involves complex sequential decision making (Section 5.4). For the first problem, we can derive the exact analytical expression for the log-likelihood; for the second problem, we have an integral expression for the loglikelihood that we can approximate numerically; and finally, for the third problem, we are in the true scenario in which the log-likelihood is intractable.
First, we describe the procedure used to perform our numerical experiments. Code 29 to run all our numerical experiments and analyses is available at the following link: https://github.com/basvanopheusden/ibs-development.

Procedure
For each problem, we simulate data from the generative model given different known settings θ true of model parameters, and we compare the accuracy (and other statistics) of both IBS and fixed sampling in recovering the true data-generating parameters through maximum-likelihood estimation. Since these methods provide noisy and possibly biased estimates of L (θ), and due to variability in the simulated datasets, the estimates θ MLE that result from optimizing the log-likelihood will also be noisy and possibly biased. To explore performance in a variety of settings, and to account for variability in the data-generation process, for each problem we consider 40 · D different parameter settings, where D is the number of model parameters (that is, the dimension of θ), and for each parameter setting we generate 100 distinct synthetic datasets.
For each dataset, we compare fixed sampling with different numbers of samples  (Audet and Dennis Jr, 2006), which affords a fast, robust exploration of the function landscape via Gaussian process surrogates. BADS has been shown to be much more effective than alternative optimization methods particularly when dealing with stochastic objective functions, and with a relatively limited budget of a few hundreds to a few thousands function evaluations (Acerbi and Ma, 2017).

Orientation discrimination
The first task we simulate is an orientation discrimination task, in which a participant observes an oriented patch on a screen, and indicates whether they believe it was rotated leftwards or rightwards with respect to a reference line (see Figure 4A). Here, on each trial the stimulus s is the orientation of the patch with respect to the reference (in degrees), and the response r is 'rightwards' or 'leftwards'.
For each dataset, we simulated N = 600 trials, drawing on each trial the stimulus s from a Gaussian distribution with mean 0 • (the reference) and standard deviation 3 • .
The generative model assumes that the observer makes a noisy measurement x of the stimulus, which is normally distributed with mean s and standard deviation σ, as per standard signal detection theory (Green and Swets, 1966). They then respond 'rightwards' if x is larger than µ (a parameter which captures response bias, or an incorrect memory of the reference line) and 'leftward' otherwise. However, a fraction of the time, given by the lapse rate γ ∈ (0, 1], the observer guesses randomly. We visually illustrate the model in Figure 4B. For both theoretical reasons and numerical convenience, we parametrize the slope σ as η ≡ log σ. Thus, the model has parameter vector θ = (η, µ, γ).
We can derive the likelihood of each trial analytically: where Φ(x) is the cumulative normal distribution function. Equation 20 takes the form of a typical psychometric function (Wichmann and Hill, 2001). Note that in this section we use Gaussian distributions for circularly distributed variables, which is justified under the assumption that both the stimulus distribution and the measurement noise are small. For more details about the numerical experiments, see Appendix B.1. In Figure 5, we show the parameter recovery using fixed sampling, IBS and the exact log-likelihood function from Equation 20. First, we show that IBS can estimate the sensory noise parameter η and lapse rate γ more accurately than fixed sampling while using on average the same or fewer samples ( Figure 5A,D). For visualization purposes, we show here a representative example with R = 1 or R = 3 repeats of IBS and M = 10 or M = 20 fixed samples (see Figure 15 in the Appendix for the plots with all tested values of R and M ). As baseline, we also plot the mean and standard deviation of exact maximum-likelihood estimation, which is imperfect due to the finite data size (600 trials), and stochasticity and heuristics used in the optimization algorithm. We omit results for estimates of the response bias µ, since even fixed sampling can match the performance of exact MLE with only 1 sample per trial.
Next, we fix η true ≡ log σ true = log 2 • , µ true = 0.1 • , γ true = 0.1 and plot the mean and standard deviation of the estimatedη andγ across 100 simulated data sets as a function of the (average) number of samples per trial used by IBS or fixed sampling ( Figure 5B,E). Fixed sampling is highly sensitive to the number of samples, and with less than 20 samples per trial, its estimate of η is severely biased. Estimating γ remains impossible even with 100 samples per trial. By contrast, IBS estimates η and γ reasonably accurately regardless of of the number of samples per trial. IBS has a slight tendency to underestimate γ, which is a result of an interaction of the uncertainty handling in BADS with our choice of model parametrization and parameter bounds. In general, estimating lapse rates is notoriously prone to biases (Prins, 2012). simulated data sets. We also plot the RMSE of exact maximum-likelihood estimation, which is nonzero since we simulated data sets with only 600 trials. D-F Same, for γ.
These results demonstrate that IBS estimates parameters of the model for orientation discrimination more accurately than fixed sampling using equally many or even fewer samples.
34 as many as 100 samples per trial to become approximately unbiased. IBS outperforms fixed sampling for both parameters and any number of samples, and even with as few as 2 or 3 repeats comes close to matching the RMSE of exact maximum-likelihood inference.

Change localization
The second problem we consider is a typical 'change localization' task (see Figure 6A), in which participants observe a display of 6 oriented patches, and after a short inter- The generative model assumes that participants independently measure the orientation of each patch in both displays. For each patch, the measurement distribution is a von Mises centered on the true orientation with concentration parameter κ, representing sensory precision. The participant then selects the patch for which the absolute circular difference of the measurements between the first and second display is largest. This model too includes a lapse rate γ ∈ (0, 1], the probability with which the participant guesses uniformly randomly across responses. Since thinking in terms of concentration parameter is not particularly intuitive, we reparametrize participants' sensory noise as η ≡ log σ ≡ − 1 2 log κ, since in the limit κ 1, the von Mises distribution with concentration parameter κ tends to a Gaussian distribution with standard deviation σ = 1 √ κ . The model has then two parameters, θ = (η, γ).
We can express the trial likelihood for the change localization model in an integral form that does not have a known analytical solution (see Appendix B.2 for a derivation).
We can, however, evaluate the integral numerically, which can take a few seconds for a high-precision likelihood evaluation across all trials in a dataset. The key quantity in the computation of the trial likelihood is ∆  Figure 6B.
As expected, the probability of a correct response increases monotonically with the amount of change, with the slope being modulated by sensory noise and the asymptote by the lapse rate (but also by the sensory noise, for large noise, as we will discuss later). Interestingly, maximum-likelihood estimation via the 'exact' method provides bi-ased estimates of η when the noise is high. This is because sensory noise and lapse become empirically non-identifiable for large η, as large noise produces a nearly-flat response distribution, which is indistinguishable from lapse. This produces a ridge in the likelihood landscape, which may induce biases and spurious trade-offs between the two parameters. This issue can be ameliorated by using Bayesian inference instead of maximum-likelihood estimation (Acerbi et al., 2014). IBS and fixed sampling perform seemingly better here because the interaction between a ridge (or flat region) in the true likelihood landscape and noisy estimates thereof depend on specifics of the problem and of the optimization procedure. For these particular settings of θ true , IBS and fixed sampling perform better at recovering η than the 'exact' method, but it does not necessarily hold true in general.
In Figure 7B,E, we show the estimates of fixed sampling and IBS for simulated data with η true ≡ log σ true = log 17.2 • and γ true = 0.1, and find that fixed sampling severely underestimates η when using less then 50 samples, and underestimates γ even with 100 samples per trial. By contrast, IBS produces parameter estimates with relatively little bias and standard deviation close to that of exact maximum-likelihood estimation.
Finally, in Figure 7C,F we show that IBS has lower RMSE than fixed sampling for both parameters when compared on equal terms of number of samples.

Four-in-a-row game
The third problem we examine is a complex sequential decision-making task, a variant of tic-tac-toe in which two players compete to place 4 pieces in a row, column or diagonal on a 4-by-9 board (see Figure 8A) showed that people's decision-making process in this game can be modeled accurately as heuristic search. A heuristic search algorithm makes a move in a given board state by searching through a decision tree of move sequences starting at that board state for a number of moves into the future. To decide which candidate future moves to include in the tree, the algorithm uses a value function defined as in which f i denotes a set of n f features (i.e., configurations of pieces on the board, such as 'three pieces on a row of the same color'; see Figure 8B), w i ∈ R the corresponding feature weights, and σ > 0 is a model parameter which controls value noise. As before, we parameterize the model with η ≡ log σ. The interpretation of this value function is that moves which lead to a high value V (board, move) are given priority in the search algorithm, and the model is more likely to make those moves. As a heuristic to reduce the search space, any moves for which the value V (board, move) is less than that of the highest-value move minus a threshold parameter ξ > 0 are pruned from the tree and never considered as viable options.

A B
Finally, when evaluating V (board, move), the model stochastically omits features from the sum Note that even though the 4-in-a-row task is a sequential game, the heuristic search model makes an independent choice on each move, with the 'stimulus' s on each trial being the current board state. Hence, the model satisfies the conditional independence assumptions of Equations 2 and 6. Note also that, even though the heuristic search algorithm can be specified as a generative 'simulator' which we can query to make moves in any board position, we have no way of calculating the distribution over its moves, since this would require integrating over all possible trees it could build, features which may be dropped, and realizations of the value noise. Therefore, we are in the scenario in which log-likelihood estimation is only possible by simulation, and we cannot compare the performance of fixed sampling or IBS to any 'exact' method.
To generate synthetic data sets for the 4-in-a-row task, we first compiled a set of  Figures 5 and 7, for the 4-in-a-row experiment and estimates of the value noise η ≡ log σ, pruning threshold ξ and feature drop rate δ.
In Figure 9, we perform the same tests as before, comparing fixed sampling and IBS, but lacking any 'exact' estimation method. Due to the high computational complexity of the model, we only consider IBS with up to R = 3 repeats, corresponding to ∼ 80 samples. The full results with all tested values of R and M are reported in Figure 17 in the Appendix. As a specific example for Figure 9B,E,H we show the estimates of fixed sampling and IBS for simulated data with η true ≡ log σ true = log 1, pruning threshold ξ true = 5 and δ true = 0.2.
Fixed sampling underestimates the value noise η, even when using M = 100 samples, whereas IBS estimates it accurately with 4 times fewer samples ( Figure 9A). This bias of fixed sampling gets worse with fewer samples ( Figure 9B), and overall, IBS outperforms fixed sampling when compared on equal terms ( Figure 9C). The same holds true for the pruning threshold ξ. IBS estimates ξ about equally well as fixed sampling, but with about half as many samples ( Figure 9D), fixed sampling is severely biased when using too few samples ( Figure 9E) and overall, IBS outperforms fixed sampling.
The results are slightly more complicated for the feature drop rate δ. As before, fixed sampling produces severely biased estimates of δ with up to 35 samples (Figure 9G), and the bias increases when using fewer samples ( Figure 9H). However, for this parameter IBS is also biased, but towards 0.25 ( Figure 9G & H), which is the midpoint of the 'plausible' upper and lower bounds used as reference by the optimization algorithm (see Appendix B.3 for details). This bias can be interpreted as a form of regression towards the mean; likely a by-product of the optimization algorithm struggling with a low signal-to-noise ratio for this parameter and these settings (i.e., a nearly flat likelihood landscape for the amount of estimation noise on the log-likelihood). The negative bias of fixed sampling helps to reduce its variance in the low-δ regime, and therefore in terms of RMSE, fixed sampling compares favorably with IBS regardless of its bias ( Figure 9I).

Log-likelihood loss
In the previous sections, we have analyzed the bias and error of different estimation methods when recovering the generating model parameters in various scenarios. Another important question, crucial for model selection, is how well different methods are able to recover the true maximum log-likelihood. The ability to recover the true parameters and the true maximum log-likelihood are related but distinct properties because, for example, a relatively flat likelihood landscape could yield parameter estimates very far from ground truth, but still afford recovery of a value of the log-likelihood close to the true maximum. We recall that differences in log-likelihood much greater than one point are worrisome as they might significantly affect the outcomes of a model comparison (Kass and Raftery, 1995;Jeffreys, 1998;Anderson and Burnham, 2002).
To compute the log-likelihood loss of a method for a given data set, we estimate the difference between the 'exact' log-likelihood evaluated at the 'true' maximumlikelihood solution (as found after multiple optimization runs) and the 'exact' loglikelihood of the solution returned by the multi-start optimization procedure for a given method, as described in Section 5.1. In terms of methods, we consider IBS and fixedsampling with different amounts of samples. We perform the analysis for the two scenarios, orientation discrimination (Section 5.2) and change localization (Section 5.3), for which we have access to the exact likelihood, either analytically or numerically.
The results in Figure 10 show that IBS, even with only a few repeats, is able to return solutions which are very close to the true maximum-likelihood solution in terms

Summary
The results in this section demonstrate that in realistic scenarios, fixed sampling with too few samples causes severe biases in parameter and maximum log-likelihood estimates, whereas inverse binomial sampling is much more accurate and robust to the number of samples used. Across all 3 models and all parameters, IBS yields parameter estimates with little bias and variance close to that of 'exact' maximum-likelihood estimation, even when using only a handful of repeats (R between 1 and 5). Conversely, fixed sampling yields severely biased parameter estimates when using too few samples per trial, especially for parameters which control decision noise, such as measurement noise and lapse rates in the two perceptual decision-making tasks, and value noise in the 4-in-a-row task. Moreover, for the two models for which we have access to 'exact' log-likelihood estimates, we found that IBS is able to recover maximum-likelihood solutions close to the true maximum log-likelihood, whereas fixed sampling remains severely biased even for many samples.
It is true that, given a large enough number of samples, fixed sampling is eventually able to recover most parameters and maximum log-likelihood values with reasonable accuracy. However, we have seen empirically that the number of samples required for reliable estimation varies between tasks, models and parameters of interests. For tasks and models where an exact likelihood or a numerical approximation thereof is unavailable, such as the problem we examined in Section 5.4, this limitation renders fixed sampling close to useless. By contrast, IBS automatically chooses the number of samples to allocate to the problem.
Finally, for complex models with a large response space, accurate parameter estimation with fixed sampling will require many more samples per trial than are feasible given the computational time needed to generate them. Therefore, in such scenarios accurate and efficient parameter estimation is only possible with IBS.
46 log-likelihood of simulation-based models given an experimental data set. We demonstrated that estimates from IBS are uniformly unbiased, their variance is uniformly bounded, and we introduced a calibrated estimator of the variance. IBS is sampleefficient and, for the purpose of maximum-likelihood estimation, combines naturally with gradient-free optimization algorithms that handle stochastic objective functions, such as Bayesian Adaptive Direct Search (BADS; Acerbi and Ma, 2017). We compared IBS to fixed sampling and showed that the bias inherent in fixed sampling can cause researchers to draw false conclusions when performing model selection. Moreover, we showed in three realistic scenarios of increasing complexity that maximum-likelihood estimation of model parameters is more accurate with IBS than with fixed sampling with the same average number of samples.
In the rest of this section, we discuss additional applications of IBS, possible extensions, and give some practical usage recommendations.

Additional applications
We developed inverse binomial sampling for log-likelihood estimation of likelihoodfree models, for the purpose of model comparison or fitting model parameters with maximum-likelihood estimation, but IBS has other practical uses.

Checking analytical or numerical log-likelihood calculations
We presented IBS as a solution for when the log-likelihood is intractable to compute analytically or numerically. However, even for models where the log-likelihood could be specified, deriving it can be quite involved and time-consuming, and mistakes in the calculation or implementation of the resulting equations are not uncommon. In this scenario, IBS can be useful for: • quickly prototyping (testing) of new models, as writing the generative model and fitting it to the data is usually much quicker than deriving and implementing the exact log-likelihood; • checking for derivation or implementation mistakes, as one can compare the supposedly 'exact' log-likelihood against estimates from IBS (on real or simulated data); • assessing the quality of numerical approximations used to calculate the log-likelihood, for example when using methods such as adaptive quadrature for numerical integration (Press et al., 1992).

Estimating entropy and other information-theoretic quantities
We can also use inverse binomial sampling to estimate the entropy of an arbitrary discrete probability distribution Pr(x), with x ∈ Ω, a discrete set (see, e.g., Cover and Thomas, 2012, for an introduction to information theory). To do this, we first draw a sample x from the distribution, then use IBS to estimate log Pr(x). The first sample and the samples in IBS are independent, and therefore we can calculate the expected value of the outcome of IBS, which is the definition of the negative entropy of Pr(x).
We can use this technique to estimate the entropy of the predicted response distribution of a generative model with a given parameter vector on any trial. For example, such quantity could be used in a behavioral model to test for the generalized Hick-Hyman law, that states that reaction time is proportional to the entropy of the available choices (Hyman, 1953). Moreover, we can generalize the method to estimate the crossentropy between two distributions (sample from one, estimate log-likelihood with the other), or the Kullback-Leibler divergence between distributions. We note that all the estimates of these quantities are also uniformly unbiased. 4

Bayesian inference
In this paper we focused on maximum-likelihood estimation, but another common approach to parameter estimation is Bayesian inference (Gelman et al., 2013). Bayesian inference has the goal of computing the posterior distribution of the parameters given the observations, computed as 4 The lack of bias in entropy estimates by IBS may be surprising in light of a theorem stating that uniformly unbiased estimators of the entropy given a finite set of samples cannot exist (Paninski, 2003).
This theorem does not apply to IBS since its sample size is a stochastic variable. It does, however, prove that one cannot estimate entropy (or similar information-theoretic quantities) with fixed sampling.
where Pr(D|θ) is the likelihood, p(θ) the prior density of the parameters (typically assumed continuous), and Z the normalization constant, known as the evidence or marginal likelihood, a quantity used for Bayesian model selection due to a number of desirable properties (MacKay, 2003). Since Z is often hard to compute, many (approximate) Bayesian inference techniques are able to calculate the posterior distribution by having access only to the unnormalized posterior, or joint distribution Pr(D|θ)p(θ); or equivalently to the log joint L(θ) + log p(θ). We see then that IBS could be used to perform Bayesian inference of likelihood-free models by providing a means to compute the log-likelihood in the log joint distribution (the prior is assumed to be a simple distribution which we can express in closed form).
In Appendix C.3, we describe how several approaches to approximate Bayesian inference could be used in conjunction with the unbiased log-likelihood estimates provided by IBS: Markov Chain Monte Carlo (Hastings, 1970;Brooks et al., 2011); variational inference (Jordan et al., 1999;Ranganath et al., 2014); and Gaussian process surrogate methods (Kandasamy et al., 2015;Järvenpää et al., 2019), including Variational Bayesian Monte Carlo (Acerbi, 2018(Acerbi, , 2019. Finally, note that the techniques in this paper can be easily applied to maximuma-posteriori (MAP) estimation -which is not quite Bayesian inference, but more like a regularized form of maximum-likelihood, that still yields a point estimate instead of a full posterior distribution. MAP estimation is attained by simply adding the logprior to the log-likelihood in the optimization objective, where the log-prior acts as a regularization term. 50

Approximate IBS for continuous responses
So far, we have assumed that the space of possible responses is discrete. This assumption is necessary since, for continuous responses, the probability that a sample from the generative model exactly matches an observed response is zero (technically, near-zero since any computer implementation of a real number is finite). For this reason, IBS will never terminate, or at least not within a physically sensible time scale.
A simple approach to make continuous responses discrete is via binning the response space. Alternatively, we recommend an approach inspired by Approximate Bayesian Computation (ABC; Beaumont et al., 2002), which we call Approximate IBS (AIBS). Given a metric D(·, ·) to measure distance in response space, and a tolerance threshold ε > 0, we can use IBS to estimate where ther i are responses drawn from the generative model, and |B ε (r i )| denotes the volume of the set of responses whose distance from r i is no more than ε.
The ε-approximate log-likelihood in Equation 24 can then be used as normal for maximum-likelihood estimation or Bayesian inference. As ε → 0, the approximate likelihood tends to the true likelihood, under some regularity conditions which we leave to explore for future work (see Prangle 2017 for a similar proof for ABC). However, the expected number of samples used by IBS diverges in that limit, so in practice there is a lower bound for ε that is feasible and one needs to extrapolate to the ε = 0 limit, or be satisfied to perform inference with an ε-approximate likelihood.
The common idea between AIBS and ABC is that they both use a distance metric to 51 judge similarity between simulated samples and data. However, ABC commonly bases the comparison on summary statistics of the data (which may not be sufficient statistics, and thus not capture all aspects of the data); whereas AIBS uses the full responses. Secondly, ABC in practice requires dedicated algorithms to perform parameter estimation and inference (basic techniques, such as rejection sampling, can be extremely inefficient); whereas AIBS simply provides a (noisy) log-likelihood, which can then be used in combination with a wider class of likelihood-based inference methods, as long as they support noisy estimates (see Appendix C.3 for some examples). We leave a further analysis of AIBS, and a comparison with other likelihood-free inference approaches, as a promising direction for future work.

Usage recommendations
We conclude with a number of recommendations for researchers who want to fit a model to a data set, having access only to a simulator or generative model.
• First, try to derive a closed-form analytic expression for the log-likelihood of the model. If this is tractable, validate that the log-likelihood is free of implementation mistakes by comparing its output against log-likelihood estimates obtained by IBS with well-chosen test trials and model parameters.
• If exact analytics are intractable, find an analytical or numerical approximation, for example using variational inference or Riemannian integration, and once again validate the quality of the approximation using IBS.
• If the model is too complex for analytical or numerical approximations, estimate the log-likelihood using inverse binomial sampling.
• Finally, perform inference using the analytical, numerical, or IBS-based loglikelihood function with a sample-efficient inference algorithm, such as those based on Gaussian process surrogate modeling. For maximum-likelihood (or maximum-a-posteriori) estimation, hybrid Bayesian optimization methods have proved to be quite effective (Acerbi and Ma, 2017).

Avoiding infinite loops
One issue of IBS is that it can 'hang', in the sense that the implementation of the estimator can run indefinitely, without returning an answer, if the simulator is unable to match a particularly unlikely observation. This is a natural behavior of IBS that stems from its efficiency in allocating samples, as we examined in Section 4.2. We recommend two easy solutions to avoid infinite loops: • Implement a 'lapse rate' γ ∈ (0, 1) in the simulator model, which represents the probability of a completely random response (typically uniform across all possible responses). The lapse rate could be fixed to a small, non-zero value (e.g., γ = 0.01), or let as a free model parameter; in which case, ensure that the minimum lapse rate is a small, non-zero value (e.g., γ min = 0.005).
• Introduce an early-stopping threshold, such that IBS stops sampling when the estimated log-likelihood of the entire data set goes below a threshold L lower (see Appendix C.1).
We implemented both of these solutions in our analyses in Section 5.

A Further theoretical analyses A.1 Why inverse binomial sampling works
We start by showing that the inverse binomial sampling policy described in Section 2.4, combined with the estimatorL ibs (Equation 14), yields a uniformly unbiased estimate of log p. This derivation follows from de Groot (1959, Theorem 4.1), adapted to our special case of estimating log p instead of an arbitrary function f (p): The first equality is the definition ofL ibs (Equation 14), using the notational convention that 0 k=1 = 0. In the second equality we introduce the indicator function 1 k<K which is 1 when k < K and 0 otherwise. The third equality follows by linearity of the expectation and the fourth directly from the definition of the indicator function. The fifth and second-to last equality uses the formula for the cumulative distribution function of a geometric variable, that is Pr(K ≤ k) = 1 − (1 − p) k , and thus Pr( The final equality is the definition of the Taylor series of log p expanded around p = 1.
Note that this series converges for all p ∈ (0, 1].
In the derivation above, we can replace 1 k by an arbitrary set of coefficients a k and show that for all p for which the resulting Taylor series converges. Equation 26 immediately proves two corollaries. First, we can use the inverse binomial sampling policy to estimate any analytic function of p. Second, since we can rewrite any estimatorL(K) as K−1 k=1 a k , and since Taylor series are unique, a k = 1 k is the only choice for which E L (K) equals log p. In other words,L ibs is the only uniformly unbiased estimator of log p with the inverse sampling policy. Therefore, it trivially is the uniformly minimumvariance unbiased estimator under this policy, since no other unbiased estimator exist.

A.2 Analysis of bias of fixed sampling
We provide here a more formal analysis of the bias of fixed sampling. We initially consider the estimatorL fixed defined by Equation 11 in the main text, but we will see that our arguments hold generally for any estimator based on a fixed sampling policy.
We showed in Figure 2 that in the regime of p → 0, M → ∞, while keeping pM → λ, the bias ofL fixed tends to a master curve. This follows since, in this limit, the binomial distribution Binom λ M , M converges to a Poisson distribution Pois(λ) and therefore the bias converges to which is the master curve in Figure 2. In particular, the bias is close to zero for λ 1 and it diverges when λ 1, or equivalently, for M This asymptotic behavior is not a coincidence. In fact, it is mathematically guaranteed since the Fisher information of Pois(λ) equals 1 λ and the reparametrization identity for the Fisher information yields I f (log λ) = λ (Lehmann and Casella, 2006). In the limit of p 1 M , which corresponds to λ 1, this Fisher information vanishes and the outcome of fixed sampling simply provides zero information about log λ or log p.
Therefore, any estimates of log p are not informed by the data and instead are a function of the regularization chosen in the estimatorL fixed (Equation 11). Note that the argument above does not invoke the specific form of the estimator, and therefore holds for any choice of regularization.
We can express the problem with fixed sampling more clearly using Bayesian statistics, in a formal treatment of the 'gambling' analogy we presented in the main text. The 'correct' belief about log λ given the outcome of fixed sampling (m) is quantified by the posterior distribution p (log λ|m), which is a product of the likelihood Pr (m|log λ) and a prior p(log λ). In the limit λ 1, the Poisson distribution converges to a Kronecker delta distribution concentrated on m = 0. In other words, almost surely none of the samples taken by the behavioral model will match the participant's response.
When m = 0, the likelihood equals exp(−λ), which is mostly flat (when considered as a function of log λ, see Figure 11) for log λ ∈ [−∞, −2] and therefore our posterior belief ought to be dominated by the prior p(log λ) and become independent of the data.
Therefore, we once again conclude that in the limit p approximately flat for all log λ ≤ −2. Since λ is defined as p M , this implies that the posterior distribution over p will be dominated by a prior rather than evidence, as quantified by the likelihood.

A.3 Estimator variance and information inequality
We proved in Section A.1 thatL IBS is the minimum-variance unbiased estimator of log p given the inverse binomial sampling policy. Here we show that the estimator also comes close to saturating the information inequality, the analogue of a Cramer-Ráo bound for an arbitrary function f (p) and a non-fixed sampling policy (de Groot, 1959), In our case, where f (p) = log p, the information inequality reduces to Std(L IBS ) ≥ √ 1 − p. In Figure 12, we plot the standard deviation of IBS compared to this lower bound.
It may be disappointing that IBS does not match the information inequality. Kolmogorov (1950)  and those estimators can saturate the information equality. Dawson (1953) and later de Groot (1959) showed that if an unbiased estimator of a non-polynomial function f (p) exists and it matches the information inequality, it must use the inverse binomial sampling policy. Moreover, de Groot derived necessary and sufficient conditions for f (p) to allow such estimators (de Groot, 1959). Therefore, the standard deviation in IBS is close (within 30%) to its theoretical minimum.
To compare the variance of IBS and fixed sampling on equal terms, we use the scaling behavior ofL fixed as M → ∞. Specifically, for fixed sampling, we plot √ M × Std(L fixed ) and for IBS we plot 1 √ p × Std(L IBS ) (see Figure 13). With this scaling, the curves for fixed sampling again collapse onto a master curve 5 . Note that repeatedsampling IBS estimatorsL IBS-R (see Section 4.4), obtained by averaging multiple IBS estimates, overlap with the curve for regular IBS for any R.
5 These curves converge pointwise on (0, 1] and uniformly on any interval (ε, 1], but not uniformly on All these curves increase and diverge as p → 0, reflecting the fact that estimating log-likelihoods for small p is hard. The standard deviation of fixed sampling is always lower than that of IBS, especially when p → 0 (specifically when p 1 M ). In other words, fixed sampling produces low-variance estimators exactly in the range in which its estimates are biased, as guaranteed by the Cramer-Ráo bound. However, in the large-M limit, fixed sampling does saturate the information inequality, so its master curve lies slightly below IBS. In other words, if one is able to draw so many samples that bias is no issue, then fixed sampling provides a slightly better trade-off between variance and computational time. However, in Section C.2, we discuss an improvement to IBS which decreases its variance by a factor 2-20, in which case IBS is clearly superior.

A.4 A Bayesian derivation of the IBS estimator
In Sections 3 and A.2 we hinted at a Bayesian interpretation of the problem of estimating log p. We show here that indeed we can see the IBS estimator as a Bayesian point estimate of log p with a specific choice of prior for p. For the rest of this section, we use q to denote the likelihood of a trial (instead of p); that is q is the parameter of the Bernoulli distribution and log q the quantity we are seeking to estimate. We changed notation to avoid confusion with expressions such as the prior probability of q, which is p(q).
Let K be the number of samples until a 'hit', as per the IBS sampling policy. Following Bayes' rule, we can write the posterior over q given K as where we used the fact that Pr(K|q) follows a geometric distribution, and we assumed a Beta(α, β) prior over q.
In particular, let us compute the posterior mean of log q under the Haldane prior, Beta(0, 0) (Haldane, 1932). Thanks to the 'law of the unconscious statistician', we can compute the posterior mean of p(log q|K) directly from Equation 29, where the first row follows from setting α = 0 and β = 0; it can be shown that the third row is the expectation of log q for a Beta distribution, with ψ(z) the digamma function (Abramowitz and Stegun, 1948); and the last equality follows from the relationship between the digamma function and harmonic numbers, that is ψ(n) = −γ + where γ is Euler-Mascheroni constant. We also used the notational convention that 0 k=1 a k = 0 for any a k . Note that the last row is equal to the IBS estimator,L IBS (K), as defined in Equation 14 in the main text.
Crucially, Equation 29 shows that we can recover the IBS estimator as the posterior mean of log q given K, under the Haldane prior for q. This interpretation allows us to also define naturally the variance of our estimate for a given K, as the variance of the posterior over log q, where ψ 1 (z) is the trigamma function, the derivative of the digamma function; the equality follows from a known expression for the variance of log q under a Beta distribution for q.

B Experimental details
In this section, we report details for the three numerical experiments described in the main text and supplementary results.

B.1 Orientation discrimination
The parameters of the orientation discrimination model are the (inverse) slope, or sensory noise, represented as η ≡ log σ, the bias µ, and the lapse rate γ. The logarithmic representation for σ is a natural choice for scale parameters (and more in general, for positive parameters that can span several orders of magnitude). to the algorithm to where the global optimum is likely to be, and are used by BADS, for example, to draw a set of initial points to start building the surrogate Gaussian process, and to set priors over the Gaussian process hyperparameters (Acerbi and Ma, 2017). Here we also use the plausible bounds to select ranges for the parameters used to generate simulated datasets, and to initialize the optimization, as described below. To generate synthetic data sets, we select 120 'true' parameter settings for the orientation discrimination task as follows. We set the baseline parameter θ 0 as η = log 2 • , µ = 0.1 • , and γ = 0.1. Then, for each parameter θ j ∈ {η, µ, γ}, we linearly vary the value of θ j in 40 increments from PLB j to PUB j as defined in Table 1 (e.g., from −1 • to 1 • for µ), while keeping the other two parameters fixed to their baseline value. For each one of the 120 parameter settings θ true defined in this way, we randomly generated stimuli and responses for 100 datasets from the generative model, resulting in 12000 distinct data sets for which we know the true generating parameters.
We evaluated the log-likelihood with the following methods: fixed sampling with When estimating parameters using IBS or fixed sampling, we enabled the 'uncertainty handling' option in BADS, informing it to incorporate measurement noise into its model of the objective function. Note that during the optimization, the algorithm iteratively infers a single common value for the observation noise σ obs associated with the function values in a neighborhood of the current point (Acerbi and Ma, 2017). A future extension of BADS may allow the user to explicitly provide the noise associated with each data point, which is easily computed for the IBS estimates (Equation 16 in the main text), affording the construction of a better surrogate model of the log-likelihood.

Alternative fixed sampling estimator
In the main text, we considered the fixed-sampling estimator defined by Equation 11.
We performed an additional analysis to empirically validate that our results do not depend on the specific choice of estimator for fixed sampling (as expected given the theoretical arguments in Section 3).
An alternative way of avoiding the divergence of fixed sampling is to correct samples that happen to be all zeros, for example witĥ for some 0 < m min < 1, which sets a lower bound for the log-likelihood equal to log (m min /M ). We then performed our analyses of the orientation discrimination task using theL fixed-bound estimator with m min = 1 2 . As shown in Figure 14, the results are remarkably similar to what we found using the fixed-sampling estimatorL fixed defined by Equation 11.

Complete parameter recovery results
For completeness, we report in Figure 15 the parameter recovery results for fixed sampling, inverse binomial sampling and 'exact' analytical methods for the orientation discrimination task, for all tested number of samples M and IBS repeats R. All estimates were obtained via maximum-likelihood estimation using the Bayesian Adaptive Direct Search (Acerbi and Ma, 2017), as described previously in this section.

B.2 Change localization
First, we derive the trial likelihood of the change localization model. Assuming that the change happens at location c ∈ {1, . . . , 6}, by symmetry we can write c ) is the absolute circular distance between the true orientations of patch c in the first and second display. We can derive an expression for s ; θ by marginalizing over the circular distance between the respective measurements, where we have defined ∆ (i) i ) and we suppressed the dependence on κ to simplify the notation. The first term in this equation is the probability density function (pdf) of the circular distance between two von Mises random variables whose centers are ∆ (j) s apart. The second term simplifies, since ∆ (i) x for all i = j are all independent and identically distributed. Therefore, we can rewrite this equation as where ∆ is an auxiliary variable generated by taking the absolute circular difference between two von Mises random variables that are centered at 0 with concentration parameter κ. The second term of the integrand, therefore, is the fifth power of the cumulative distribution function (cdf) of ∆. We can compute the distribution of the circular distance between two von Mises random variables analytically, but the cdf is non-analytic. Moreover, the integral in Equation 35 is analytically intractable as well.
We can, however, evaluate it numerically via trapezoidal integration (see Figure 6B).
We now describe the settings used for maximum-likelihood estimation. The model parameters are the sensory noise, represented as η ≡ log σ (with σ = 1 √ κ ), and the lapse rate γ, with bounds defined in Table 2. We use the same procedure and settings for BADS as for the orientation discrimination task (see Section B.1). For IBS, we use an early-stopping threshold of L lower = −N log 6, and we use repeats R ∈

Complete parameter recovery results
We report in Figure 16 the parameter recovery results for fixed sampling, inverse binomial sampling and 'exact' methods for the change localization task, for all tested 69 number of samples M and IBS repeats R. For this task, the exact method relies on numerical integration. severely biased for both the measurement noise η and the lapse rate γ, whereas IBS is accurate for η and biased for γ, but still much less biased than fixed sampling.

B.3 Four-in-a-row game
The four-in-a-row game model parameters are the value noise η ≡ log σ, the pruning threshold ξ, and the feature dropping rate δ, with bounds defined in Table 3. We use the same procedure and settings for BADS as for the orientation discrimination task (see   for the specific patterns). C act is a parameter which scales the value of features belonging to the active or passive player. The parameter γ tree is inversely proportional to the size of the tree built by the algorithm, and λ is the lapse rate, that is the probability of a uniformly random move among the available squares (note that for the other models we denoted lapse rate as γ; here we use the variable naming from van Opheusden et al., 2016). See van Opheusden et al. (2016) for more details about the model and its parameters.

Complete parameter recovery results
We report in Figure 17 the parameter recovery results for fixed sampling and inverse binomial sampling for the 4-in-a-row task, for all tested number of samples M and IBS repeats R. For this task, there is no 'exact' method to evaluate the log-likelihood.
C Improvements of IBS and further applications C.1 Early stopping threshold One downside of inverse binomial sampling is that the computational time it uses to estimate the log-likelihood is of the order of 1 p , which is equal to exp (− log p) = exp (−L).
In other words, IBS spends exponentially more time on estimating log-likelihoods of poorly-fitting models or bad parameters. This implies that an optimization algorithm that uses IBS allocates more computational resources to estimating the objective function L (θ) for parameter vectors θ where the objective is low. However, the value of the objective at such poor parameter vectors are unlikely to affect its estimate of the location or value of the maximum, so the optimizer (BADS in our case) is wasting time.
It may be possible to develop optimization algorithms that take into account the expo- nentially large cost of probing points where the objective function is low, but we can circumvent the problem by amending IBS with a criterion that stops sampling when it realizes thatL (θ) will be low.
In Section 4.1, we introduced a basic implementation of IBS for estimating the loglikelihood of multiple trials, by sequentially computing the log-likelihood of each trial.
However, another way to implement multi-trial IBS (a 'parallel' implementation) is to draw one sample from the simulator model for each trial, then set K i = 1 for each trial where the sample matches the participant's response. For all other trials, draw a second sample from the model, and if that matches the response, set K i = 2. Finally, repeat this process until no more trials remain. We illustrate this sampling scheme graphically in Figure 18.
After each iteration, we then computê where K is the iteration number, I match is the set of trials where we found a matching sample and N remaining is the number of remaining trials. This valueL K is decreasing and by construction converges to N i=1L i,IBS as K → ∞. Therefore, wheneverL K exceeds a lower bound L lower , we are guaranteed thatL IBS will exceed that bound too. When it does, we stop sampling and return L lower as estimate of L (θ). This does introduce bias into the estimate, but since we bound the total log-likelihood, the bias will be exponentially small in N as long as the true value L (θ) − L lower is nonzero.
In practice, we recommend using as lower bound the log-probability of the data implementation of multi-trial IBS is 'columns-first', to sample model responses for each trial until a hit, and only then move to the next trial. However, a more convenient sampling method is 'rows-first', and sample one response for each trial with k = 1, then one response for each trial with k = 2, excluding trials 2 and 4 since the first sample was a hit, and continue increasing k until all trials reach a hit. This method allows for early stopping and a parallel processing.

75
Sections 5.2 and 5.3, the log-likelihood of a chance model is −N log 2 and −N log 6, respectively. For the 4-in-a-row game presented in Section 5.4 the log-likelihood of chance depends on the number of pieces on each board position; we chose an average value such that L lower = −N log 20. This new sampling scheme has an additional advantage: since on each iteration we independently sample from the generative model on multiple trials, we can potentially run these computations in parallel.

C.2 Reducing variance by trial-dependent repeated sampling
As we saw in Section 4.4, a simple method to improve the estimate of IBS is to run the estimator multiple times and average the results. Repeated sampling will preserve the zero bias but reduce variance inversely proportional to the number of repeats R.
We can further improve the estimator by varying the number of repeats R i between trials, for 1 ≤ i ≤ N , and definê where R is a vector of positive integers with elements R i , andL We can then ask what is the best allocation of repeats R i that minimizes the variance of the estimator in Equation 37 such that the expected total number of samples does not exceed a fixed budget S. This defines the following constrained optimization problem, where we used Equation 15 for the variance of the IBS estimator.
Assuming that the R i take continuous values, we can solve the optimization problem in Equation 38 exactly using a Lagrange multiplier, and find the following closed-form expression for the optimal number of repeats per trial, According to Equation 39, the optimal choice of repeats entails dividing the budget S across trials, where trial i is allocated repeats proportional to p i Li 2 (1 − p i ). We plot this function in Figure 19 and see that, to minimize variance, we should allocate resources primarily to trials where p i is close to 1 2 and avoid trials where p i ≈ 1 (since the variance of IBS is already small for those trials) or where p i ≈ 0 (since those utilize a larger share of the budget).
We can also calculate exactly the fractional increase in precision (inverse variance) when using the optimal choice of repeats vector R * , compared to a constant R which divides the budget equally across trials, This equation implies that the gain in precision from this method depends on the distribution of p i across trials. If p i ∼ Uniform[0,1] and N = 500, the median precision gain is 1.584 and the inter-quartile range (IQR) is 1.375 to 2.090. Note that the gain is always greater than 1, unless p i is constant across trials.  Figure 19: Graph of pLi 2 (1 − p), which is proportional to the optimal number of repeats for a trial with likelihood p (see equation 39). We observe that the optimal allocation of computational resources entails repeated sampling for trials with p ≈ 1 2 and to avoid p ≈ 0 or p ≈ 1.

Practical implementation of trial-dependent repeated sampling
In practice, Equation 39 cannot be applied directly, as we have treated R i as continuous variables, but the number of times to repeat IBS on a given trial has to be an integer.
Additionally, the method is only unbiased if R i is at least 1 for each trial i. Therefore, we can convert R * i to integers by rounding up to the nearest integer. This method will make our solution approximate, and reduce the gain in precision, but it is still better than uniform repeats for uniformly distributed p i (median: 1.567, IQR: 1.374-2.002).
The derivation above has another, more fundamental problem. Computing R * i requires knowledge of p i on each trial, which we do not have. While we could try and learn the allocation of R * as a function of θ in some adaptive way, in practice we recommend the following simple scheme: 1. Choose a default parameter vector θ 0 , and run IBS with a large number of repeats (e.g, R = 100) to estimate the (log)-likelihood of the model on each trial.
2. Compute the optimal repeats R * i given the estimated likelihoodsp i and a total budget of expected samples S per likelihood evaluation, and round up.
3. Run IBS with those fixed repeats per trial on each iteration of the optimization algorithm.
This approach implicitly assumes that the log-likelihood will be correlated across trials between the generative model with parameter vector θ 0 and any other vector θ probed by the optimization algorithm. This is usually the case, since low-probability trials are often those where something unexpected occurred (e.g., the participant of a behavioral experiment lapsed or otherwise made an error). In our experience, this scheme considerably reduces the variance of IBS for a given computational time budget.

C.3 Bayesian inference with IBS
While the main text focused on maximum-likelihood estimation, the unbiased loglikelihood estimates provided by IBS can also be used to perform Bayesian inference of posterior distributions over model parameters. We describe here a few possible approaches to approximate Bayesian inference with IBS.

Markov Chain Monte Carlo
Markov Chain Monte Carlo (MCMC; see e.g. Brooks et al., 2011) is a powerful class of algorithms that allows one to sequentially sample from a target probability density which is known up to a normalization constant (e.g., the joint distribution). A popular form of MCMC is known as Metropolis-Hastings (MH;Hastings, 1970), which explores the target distribution by drawing a sample from a 'proposal distribution' centered on the last sample (e.g., a multi-variate Gaussian). MH 'accepts' or 'rejects' the new sample with an acceptance probability that depends on the value of the target density at the proposed and at the last point. In case of acceptance, the new point is added the sequence of samples; otherwise, the last sample is repeated in the sequence. Under some conditions, the MH algorithm produces a (correlated) sequence of samples that are equivalent to draws from the target density. Crucially, and somewhat surprisingly, the MH algorithm is still valid (that is, produces a valid sequence) if one performs the comparison with a noisy but unbiased estimate of the target density as opposed to using the exact density (Andrieu et al., 2009).
One problem here is that IBS provides an unbiased estimate of the log-likelihood (and thus of the log target density); not of the likelihood. However, since the IBS estimates of the log-likelihood are nearly-exactly normally distributed (see Section 4.3), the distribution of the likelihood is log-normal. Thus, we can apply what is known as a 'convexity correction' and compute a (nearly) unbiased estimate of the likelihoodˆ (θ) by calculating the expected value of a log-normal variable, that iŝ (θ) = exp L (θ) + 1 2 Var L (θ) .
Equation 41 can be easily evaluated with IBS, using the expression for the variance of the IBS estimator (Equation 16).

Variational inference
An alternative class of approximate inference methods is based on variational inference (VI; Jordan et al., 1999). The goal of VI is to approximate the intractable posterior distribution with a simpler distribution q(θ) belonging to a chosen parametric family.
A common choice is a multivariate normal with diagonal covariance (known as mean field approximation); but other choices are possible too. VI selects the 'best' approximation q(·) that minimizes the Kullback-Leibler divergence with the true posterior, or equivalently maximizes the following variational objective, where H [q] is the entropy of q(·), which we assume can be computed analytically or numerically. Crucially, we can obtain an unbiased estimate of the first term in Equation 42 (the expected log joint) with IBS, as we have seen in Section 6.1. The optimization of the variational objective can then be performed directly with derivative-free optimization methods (such as BADS), or via a technique that produces unbiased estimates of the gradient combined with variance-reduction tricks, called black-box variational inference (Ranganath et al., 2014).

Gaussian process surrogate methods
One issue with the approximate inference methods described above is that they require a large (possibly, very large) number of likelihood evaluations to converge. Thus, these approaches are unfeasible if the generative model is somewhat computationally expensive, as it is often the case. An alternative family of methods designed to deal with expensive likelihoods builds a Gaussian process approximation (a surrogate) of the log joint distribution, and uses it to actively acquire further points in a smart way, similarly to the approach of Bayesian optimization (Kandasamy et al., 2015;Acerbi, 2018;Järvenpää et al., 2019). However, unlike Bayesian optimization, the goal here is not 81 to optimize the target function, but instead to build an accurate approximation of the posterior distribution, with as few likelihood evaluations as possible.
IBS is particularly suited to be used in combination with Gaussian process surrogate methods as it provides both an unbiased estimate of the log-likelihood, and a calibrated estimate of the uncertainty in each measurement, which can be used to inform the Gaussian process observation model.