Prepaid parameter estimation without likelihoods

In various fields, statistical models of interest are analytically intractable and inference is usually performed using a simulation-based method. However elegant these methods are, they are often painstakingly slow and convergence is difficult to assess. As a result, statistical inference is greatly hampered by computational constraints. However, for a given statistical model, different users, even with different data, are likely to perform similar computations. Computations done by one user are potentially useful for other users with different data sets. We propose a pooling of resources across researchers to capitalize on this. More specifically, we preemptively chart out the entire space of possible model outcomes in a prepaid database. Using advanced interpolation techniques, any individual estimation problem can now be solved on the spot. The prepaid method can easily accommodate different priors as well as constraints on the parameters. We created prepaid databases for three challenging models and demonstrate how they can be distributed through an online parameter estimation service. Our method outperforms state-of-the-art estimation techniques in both speed (with a 23,000 to 100,000-fold speed up) and accuracy, and is able to handle previously quasi inestimable models.


Introduction
Models without an analytical likelihood are increasingly used in various disciplines, such as genetics [2], ecology [31,6], economics [19,7] and neuroscience [28]. For models without an analytical or easily computable likelihood, parameter estimation is a major challenge for which a variety of solutions have been proposed [31,2,12]. All these methods have in common that they rely on extensive Monte Carlo simulations and that their convergence can be painstakingly slow. As a result, the current methods can be very time consuming.
To date, the practice is to analyze each data set separately. However, considering all the calculations that have ever been performed during parameter estimation of a particular type of model, for each different data set, one cannot help but notice an incredible waste of resources. Indeed, simulations performed while estimating one data set may also be relevant for the estimation of another. Currently, each researcher estimating the same model with different data will start from scratch, and not benefit from all the possibly relevant calculations that have already been performed in earlier analyses by other researchers, in other locations, on different hardware, and for other data sets, but concerning the same model. Therefore, our proposal is to combine resources and create a giant online database of parameter values coupled to simulated data. Consequently, (cloud based) interpolation techniques and global optimization methods can be used on the previously created (hence, prepaid) database for accurate and fast parameter estimation on any device. Statistical analyses that currently take up hours to days of computation time on dedicated hardware are now available to everyone within seconds.
In Figure 1 we present a graphical illustration of the prepaid parameter estimation method. First (panel A), for a representative number of parameter vectors θ, large data sets are simulated, compressed into summary statistics (i.e., s sim ) and saved -creating the prepaid grid. This prepaid grid is computed beforehand and the results are stored at a central location. Second (panel B1), the observed (data) summary statistics (s obs ) are compared to the simulated (data) summary statistics (i.e., s sim ) using an appropriate objective loss function d s sim , s obs and a number of nearest neighbor simulated summary statistics are selected. The loss function is related to the loss function used in the generalized method of moments [10] and method of simulated moments [8].
Third (panel B2), interpolation methods are used to find the relation s = f (θ) between the parameter values and the summary statistics for the selected points of the previous step [9,20]. In this paper, we use tuned least squares support vector machines, LS-SVM [25]. Finally (panel B3), the objective loss function d s pred , s obs , now using predicted summary statistics s pred , is minimized as a function of the unknown parameter values using an optimizer.
Three important aspects of the prepaid method deserve special mention. First, the parameter space is required to be bounded. If this is unnatural for a given parametrization, then the parameters have to be appropriately transformed to a bounded space. Second, we typically start from a uniform distribution of parameter vectors in the final parameter space. This choice reflects on the uniformity of the grid's resolution, but has no further implications provided the grid is sufficiently dense. Bayesian priors can be implemented without recreating the prepaid grid, since the prior can be taken into account in the loss function. Third, often a user is not interested in a single instance of a model, but rather has data from several experimental conditions that share some common parameters but assume other ones to be different. Also in these cases the prepaid grid does not need to be recreated, as the parameter constraints can be included through priors with tuning parameters (i.e., penalties).
The performance of the prepaid method can be studied theoretically in simple situations (see Methods 5.1). In what follows, the prepaid method will be applied to three more complicated, realistic scenarios.  Figure 1: Graphical illustration of the prepaid parameter estimation method. 4 3 Results

Example 1: The Ricker model
In a first example, we apply our prepaid method to the Ricker model [27,31] which describes the dynamics of the number of individuals y t in a species over time (with t = 1 to T obs ): where e t ∼ N 0, σ 2 . The variables N t (i.e., the expected number of individuals at time t) and e t are hidden states. Given an observed time series {y t } T obs t=1 , we want to estimate the parameters θ = {r, σ, φ}, where r is the growth rate, σ the process noise and φ a scaling parameter. The Ricker model can demonstrate near-chaotic or chaotic behavior and no explicit likelihood formula is available.
Wood [31] used the synthetic likelihood to estimate the model's parameters. In the original synthetic likelihood approach (denoted as SL Orig ), the assumed multivariate normal distribution of the summary statistics is used to create a synthetic likelihood. The mean and covariance matrix of this normal distribution are functions of the unknown parameters and are calculated using a large number of model simulations. The synthetic likelihood is proportional to the posterior distribution from which is sampled using MCMC and a posterior mean is computed.
Wood's synthetic likelihood SL Orig approach is compared to the prepaid method, where we create a prepaid grid of the mean and the covariance matrix of a similar set of summary statistics. Prepaid estimation comes in multiple variants, depending on the use of an interpolation method. The first, which uses only the prepaid grid points and chooses the nearest neighbor (maximum synthetic likelihood) as final estimate, will be called SL Grid ML . The second, SL SVM ML , uses LS-SVM to interpolate between the parameters in the prepaid grid to increase accuracy. The differential evolution algorithm (a global optimizer; [24]) is used to maximize this interpolated synthetic (log)likelihood. Additional details on the implementation of the synthetic likelihood can also be found in Methods 5.2. (1) SL Orig as in [31], the prepaid method (2) with (SL SVM ML ), and (3) without (SL Grid ML ) interpolation. As can be seen in Figure 2, the prepaid estimation techniques lead to better results than the synthetic likelihood for T obs = 1, 000, both in accuracy and speed. The SL Orig method leads to some clear outliers (see Methods 5.2 ) which testifies to possible convergence problems (probably due to local minima). The prepaid method suffers much less from this problem. Most striking is the speed up of the prepaid method: The SL Grid ML version of the prepaid estimation is finished before a single iteration of the 30,000 iterations in the synthetic likelihood method has been completed -100,000 times faster. In addition, it is demonstrated that the coverages of the prepaid method confidence intervals are very close or exactly equal to the nominal value (we look at 95% bootstrap-based confidence intervals).
SVM interpolation is mainly helpful for large T obs , where one expects a higher accuracy of the estimates and the grid is too coarse. The analyses with large T obs could only be completed in a reasonable time using the prepaid method (See Methods 5.2 for more detailed information).
In the application above, the tacitly assumed prior on the parameter space is uniform. In addition, there is only one data set for which a single triplet of parameters (r, σ, φ) needs to be estimated. In Methods 5.2, we show how both limitations can be relaxed. First, it is explained how different priors for the Ricker model can be implemented. Second, it is discussed what can be done if there are two data sets (i.e., conditions) for which it holds that r 1 = r 2 and σ 1 = σ 2 but φ 1 and φ 2 are not related.
Finally, we also tested our estimation process on the population dynamics of the Chilo partellus, extracted from Figure 1 in Taneja and Leuschner [26,32]. Here we found that r = 1.10 (95% confidence interval 1.06-1.34), σ = 0.43 (95% confidence interval 0.30 -0.54) and φ = 140.60 (95% confidence interval = 43.94 -208.19). We found similar results using the synthetic likelihood method (see Methods 5.2), but our estimation was 4000 times faster.  Equation 1). The RMSE and time are based on 100 test data sets with T obs = 1000. The three colors represent the three parameters (blue for r, red for σ and yellow for φ). Solid lines represent the SL Orig approach, dashed lines the SL Grid ML approach (using only nearest neighbors), and dotted lines the SL SVM ML approach (using interpolation). The stars and the dots represent the time needed for the SL Grid ML and the SL SVM ML estimation, respectively. The estimates for SL Orig are posterior means, based on the second half of the finished MCMC iterations.

Example 2: A stochastic model of community dynamics
A second example we use to illustrate the prepaid inference method is a trait model of community dynamics [14] used to model the dispersion of species. For this model (see also Methods section), there are four parameters to be estimated: I, A, h, and σ. As with the first application, there is no analytical expression for the likelihood [14].
As an established benchmark procedure for this trait model, we apply the widely used Approximate Bayesian Computation (ABC) method [4,30,23,1] as implemented in the Easy ABC package and denoted here as ABC Orig PM (PM stands for posterior means, which will be used as point estimates) [16]. As priors, we use uniform distributions on bounded intervals for log(I), log(A), h and log(σ) (see Methods 5.3 for the exact specifications), but this can be easily changed as explained for the first example.
To allow for a direct comparison with the ABC method, and to illustrate the versatility of the prepaid method, we have also implemented three Bayesian versions of the prepaid method. The first, SL Grid PM , creates a posterior proportional to the prepaid synthetic likelihood. The second, ABC Grid PM , saves not only, the mean and covariance matrix of the summary statistics for every parameter in the prepaid grid, but also a large set of uncompressed summary statistics. Using these statistics we are able to approximate an ABC approach.
The third, ABC SVM PM , again interpolates between the grid points to achieve a higher accuracy. In a third example, we apply our method to stochastic accumulation models for elementary decision making.
In this paradigm, a person has to choose, as quickly and accurately as possible, the correct response given a stimulus (e.g., is a collection of points moving to the left or to the right). Task difficulty is manipulated by applying different levels of stimulus ambiguity.
A popular neurally inspired model of decision making is the Leaky Competing Accumulator (LCA [29]).
For two response options, two noisy evidence accumulators (stochastic differential equations, see Methods section) race each other until one of them reaches the required amount of evidence for the corresponding option to be chosen. The time that is required to reach that option's threshold is interpreted as the associated 8 choice response time. For different levels of stimulus difficulty, the model produces different levels of accuracy and choice response time distributions. The evidence accumulation process leading up to these choices and response times is assumed to be indicative of the activation levels of neural populations involved in the decision making.
As in the first two examples, there is no analytical likelihood available that can be used to estimate the parameters of the LCA. Moreover, the LCA is an extremely difficult model to estimate. To the best of our knowledge, only [21] systematically investigated the recovery of the LCA parameters, but for a slightly different model (with three choice options) and with a method that is impractically slow for very large sample sizes, making it difficult to show near-asymptotic recovery properties with. For an experiment with four stimulus difficulty levels, the LCA model has nine parameters. However, after a reparametrization of the model (but without a reduction in complexity), it is possible to reduce the prepaid space to four dimensions (see Methods 5.4) and conditionally estimate the remaining subset of the parameters with a less computationally intensive method. Three variants of the prepaid method have been implemented: taking the nearest neighboring parameter set (based on a symmetrized χ 2 distance between distributions) on the prepaid grid (CHISQ Grid NN ), averaging over the grids nearest neighboring parameter sets of 100 non-parametric bootstrap samples (CHISQ Grid BS ), using SVM interpolation for every bootstrap estimate (CHISQ SVM BS ). A nearest neighbor or bootstrap averaged estimate completes in about a second on a Dell Precision T3600 (4 cores at 3.60GHz), an SVM interpolated estimate requires a couple of minutes extra.  Figure 4 shows detailed recovery scatter plots for a subset of the parameters for 1,200 observed trials, which is the typical size of decision experiments. To get better recovery, larger sample sizes have to be considered (see Methods section). In general, recovery is much better than what has been reported in [21]. The coverage of the method, based on non-parametric bootstrapping, is satisfactory for all sample sizes, provided SVM interpolated estimates are used for T obs > 100000. In addition, we do not find evidence for a fundamental identification issue with the two option LCA, as has been stated in [21].

Discussion
In three examples, we have demonstrated the efficacy and versatility of the prepaid method. The prepaid method is at least as accurate as current methods, but many times faster (23,000 to 100,000-fold speed up).
Besides the improvements at the level of speed and accuracy, the prepaid method has a number of other distinct advantages. First, the prepaid method can be used for a very large number of observations, contrary to the synthetic likelihood or ABC methods. The use of very large simulated data sets allows investigation of large-sample properties of the estimator, which is a problem for the synthetic likelihood and ABC. Second, because of the enormous speed improvement and having data sets available across the whole parameter space, the prepaid method allows for fast yet extensive testing of recovery of simulated data across this space -the recovery of every single parameter set can be evaluated. Such a practice leads to detailed internal quality control of the used estimation algorithm.
Although the idea behind the prepaid method is fairly simple, we want to anticipate a few misconceptions that might arise. First, as has been demonstrated in the context of the Ricker model, the prepaid method can easily deal with different priors and with equality constraints on parameters, without the need to recreate the underlying prepaid grid. Second, the observed data based on which the model parameters have to be estimated can be of any size, again without the need to recreate the prepaid grid for each and every sample size.
Ideally, the prepaid databases and the corresponding estimation algorithms will be constructed and made available by a team of experts for the model at hand. Subsequently, a cloud based service can then be set up to offer high quality model estimations to a broad public of researchers. As an example, we created such a service for the Ricker model in Equation 1: www.prepaidestimation.org, where we allow the user to estimate the parameters of the Ricker model for personal data as well as 4 example data sets including one real life data set [26,32]. By using such a cloud based service, researchers that need their data analyzed with computationally challenging models, can avoid many of the pitfalls they would otherwise encounter venturing out on their own. This practice will also lead to increased reproducibility of computational results.
A first possible objection against the prepaid method is the considerable initial simulation cost (for the examples discussed, prepaid simulations took up to a couple of days on a 20-core processor). However, this overhead cost will dissipate entirely as increasingly more estimates are sourced from the same prepaid database. Moreover, the initial prepaid cost can be easily distributed across multiple interested parties. Further, because the database can be used for internal quality control, additional simulation studies investigating the recovery of parameters are made redundant.
A second possible objection is that the prepaid grid, unsurprisingly, does not escape the curse of dimensionality: The grid size grows exponentially with the number of parameters. The prepaid method is most effective for highly nonlinear models with substantively meaningful parameters, as they appear in various computational modeling fields. Thus, the number of parameters cannot be very large. However, this limitation can be alleviated in a number of ways. First, the use of interpolation techniques allows for a substantial reduction of the number of prepaid points (by a factor of five for the same accuracy in the trait model example; see Methods section). Second, as is shown in the LCA example, it is possible to only partially apply the prepaid method, estimating with normal techniques, the less challenging parameters conditionally on a prepaid grid of the more intricately connected ones. Third, as shown by tackling three challenging examples, current storage and throughput possibilities can accommodate realistically sized prepaid databases.
It is our strong belief that this method will massively democratize the use of many computationally expensive models, which are now reserved for people with access to specific high-end hardware (e.g., GPUs, HPC). Apart from such democratization, this approach could significantly impact the current work flow of scientific modeling, in which every part of the estimation is carried out locally by an individual researcher.

A toy example: Estimating the mean of a normal
For a very simple setting, we want to study the performance of the prepaid methods analytically.
Assume y i ∼ N (µ, s 2 ) (i = 1, . . . , T obs ) with the mean µ unknown (and to be estimated and the standard deviation s known (so number of parameters K = 1). The observed mean is denoted asȳ. We will explore two situations. In the first situation,ȳ will be our summary statistic s obs (hence number of summary statistics R = 1) to estimate µ (ȳ is also a sufficient statistic for µ). In the second situation, we will study what happens if s obs =ȳ 2 is chosen to be the summary statistic.
Situation 1: s obs =ȳ As a prepaid grid, we take N r evenly spaced µ-values with spacing or gap size For each value µ j , T sim values of y are simulated and the sample average is computed (i.e., y sim j ). Typically, T sim = 1000 or larger. Hence, every value of µ j is paired with a particularȳ sim j : (µ j ,ȳ sim j ).
Because of the linearity of the problem, we can safely assume that if T sim is large enough, the N selected µ values are all consecutive or nearly consecutive (because of noise in the prepaid simulation ofȳ sim , it can happen that the N selected µ values are not consecutive). We denote the average of these N µ-values as M µ . If all values are exactly consecutive, M µ can be expressed as In addition (assuming that all values are exactly consecutive), their variance V µ is given by Hence, their standard deviation is S µ ≈ ∆N 2 √ 3 . Using the N pairs, we assume as a linear interpolator in this example a linear regression model that links the simulated statistics to the true underlying µ:ȳ sim Tsim . Obviously, β 0 = 0 and β 1 = 1.
Givenȳ, N selected prepaid points and the fitted linear regression model, we know from linear regression theory that: where 0 and 1 are the true β 0 and β 1 and The distribution is assumed to hold for repeated simulations of the replicated statistics in the prepaid grid.
Because we work with linear regression, the optimization problem is simple. In this case, the optimal value of µ for a givenȳ can be found by inverting the regression line: Next, we can study the properties ofμ. We begin by calculating the conditional mean E(μ|ȳ) and conditional variance Var(μ|ȳ). Hence, we treat the observed data (or sample average) as given and fixed.
These expectations are taken over different simulations ofȳ sim j 's in the prepaid grid. Before giving the expressions, it is useful to note that 14 Now, using the approximations given in [22] for ratios of random variables, we find that: Invoking the double expectation theorem to arrive at the unconditional expectations, we have: where α = M µ − µ, that is, the difference between the true value and the mean of the selected nearest neighbors µ's. Likewise, we can derive: From Equation 2, we learn that if there is no systematic deviation in the selection of µ-grid points, the prepaid estimator is unbiased. In the other case, the is bias decreases with T sim but is proportional to s 2 . In Equation 3, the leading term of the variance is s 2 T obs , which is the same as in classical estimation theory. For the other terms, they all have T sim (or a power of it) in the denominator. Because T sim is usually quite large, these terms tend to be in general of lesser importance. However, some terms also have both N (the number of selected nearest neighbor grid points) and ∆ (the gap size). It is worthwhile to note that increasing the resolution (i.e., decreasing ∆), while keeping N constant, will increase the additional terms and thus add to the error. The reason for this is that the interpolation is defined on a too small grid, leading to uncertainty in the estimated regression. This effect is illustrated in the left panel of Figure 5 in which the root mean square error (RMSE) is shown for the estimation of µ for different values of N and ∆. The plot is constructed by means of a simulation study, but confirms our analytical results.
Situation 2: s obs =ȳ 2 In the second situation, we will again estimate µ (the unknown mean of a unit variance normal), but in this case s obs =ȳ 2 is used as a statistic. Thus, the relation between the simulated statisticsȳ sim 2 and µ is quadratic (and thus nonlinear). Again we use a local linear approximation. Clearly, this approximation will only be approximately valid if we do not choose the area of approximation too large.
However, unlike in the first situation, we do expect an additional effect of the approximation error.
No analytical derivations were made for this case, but we conducted a similar simulation study as in situation 1. The results (in terms of RMSE) are shown in the right panel of Figure 5. As can be seen, there is a clear optimality trade-off visible between ∆ and N . This can be explained as follows: Fix N and then consider the gap size ∆. If ∆ is too small, we get a similar phenomenon as in the left panel, that is a large RMSE. However, if we take ∆ too large, then the approximation error will dominate (because the linear interpolation misfits the quadratic relation). The optimal point will be different for different N .
This toy example demonstrates the sound theoretical foundations of the prepaid method in well-behaved situations. However, the question is how well the method performs for real life examples. Three hard problems will be studied next.

Application 1: The Ricker model
The basic model equations of the Ricker model is given in Equation 1.

Synthetic likelihood estimation
For the synthetic likelihood estimation (SL Orig ), we made use of the synlik package [5]. The synthetic likelihood l s for a data set with summary statistics s obs and a certain parameter vector θ = (r, σ, φ) is given by whereμ θ andΣ θ are the estimated mean and covariance of the summary statistics when Equation 1 is simulated multiple times with parameter θ.
The statistics used by the synthetic likelihood function were the average population size, the number of zeros, the autocovariances up to lag 5, the coefficients of the quadratic linear autoregression of y 0.3 t and the coefficients of the cubic regression of the ordered differences y t − y t−1 on the observed values.
For each data set we used the synthetic likelihood Markov chain Monte Carlo (MCMC) method with 30000 iterations, a burn in of 3 time steps and 500 simulations to compute eachμ θ andΣ θ [5]. We used the following prior: The synlik package generates the MCMC chain on a logarithmic scale, we estimated the parameters as the exponential of the posterior mean. To ensure convergence, only the last half of the chain is used (the last 15000 iterations).

Creation of the prepaid grid
For the prepaid estimation, we used the same summary statistics as for the traditional synthetic likelihood, except for two differences. First, the coefficients of the cubic regression of the ordered differences y t − y t−1 on the observed values could not be used, because the observed values are not available when creating the prepaid grid. Second, we changed the number of zeros to the percentage of zeros to make this statistic independent of T obs (as this may change depending on the observation).
We filled the prepaid grid with 100000 parameter sets using the priors of Equation 5. To cover this grid as evenly as possible (and avoiding too large gaps), the uniform distribution was approximated using Halton sequences [18,17]. For each parameter set in the prepaid grid, we simulated a time series of length 10 7 and used the summary statistics of this long time series asμ θ .
Each time series was then split into series of length T prepaid = 100,1000 and 10000 which were used to compute the covarianceΣ θ,Tprepraid for the statistics computed on data of these lengths. This means, for example, that we had 100000 series of length 100 to compute the covariance matrix for a certain parameter set for time series of length 100. If we need to estimate parameters of a time series with T obs not equal to one of the T prepaid lengths, we use the covariance matrix created with time series of length T prepaid which is closest to T obs in logarithmic scale and adapt the covariance matrix intô The creation of the prepaid grid took approximately one day on a 3.4GHz 20-core processor.
To allow the estimation for a bigger range of parameters for the online estimation at www.prepaidestimation.org we created a new and bigger prepaid grid using the following priors: We filled to prepaid grid with 100000 parameter sets and used this prior for the real life data set.

Prepaid estimation
Four variants of prepaid estimation were implemented for this example. All use the negative synthetic likelihood as distance (d s sim , s obs as defined in the main text and Figure 1). First, we do a nearest neighbor estimation SL GRID ML , without using any interpolation between the grid points of the prepaid data set. We compute the synthetic likelihood of all the prepaid parameters for the summary statistics of the test data set. The parameter vector with the highest likelihood, the so-called nearest neighbor may already be a good estimation. For a low number of time points T obs , it is to be expected that the error on the parameter estimate is much larger than the gaps in the prepaid grid, and in such a case, the SL GRID ML estimation approach suffices.
Second, a more accurate estimation can be acquired by interpolating between the parameter values in the prepaid grid (SL SVM ML ). Therefore, we learn the relation between the parameters and the summary statistics: f svm : θ → s. However, we only learn this relation in the region of interest, that is the 100 nearest neighbors according to the synthetic likelihood. For each summary statistic, we create, on the fly, a separate least squares support vector machine (LS-SVM) [25] using the 100 nearest neighbors. This machine learning technique is chosen as it is a fast non-linear method which generalizes well. We limit the predictions to the possible range of the summary statistics (e.g., to prevent a percentage of zeros, one of the statistics, larger than 1).
We then use the differential evolution global optimizer [24] to find the maximum of: whereΣ θ,T obs is the covariance matrix of the statistics of the nearest neighbor as defined in Equation 6.
The superscript "PP" is used to denote that we use the prepaid version of synthetic likelihood, and not the traditional version as used by [31] (see Equation 4). The optimization process is constrained and we use the minima and maxima for each parameter of the 100 nearest neighbors as effective bounds.
The SL SVM ML approach makes use of a non-linear black box interpolator. However, we may also consider using a much faster linear regression (see also the toy example in Section 5.1). Therefore, we will also compare the SL SVM ML (and SL Grid ML ) approach to a third option where we predict the summary statistics using a linear regression (called the SL Lin ML approach).
Third, we can easily implement a prior for the likelihood in Equation 4. This leads to a posterior given by p(θ|s obs ) ∝ p(θ)l s (θ).
The parameters will be estimated as the maximum a posteriori (MAP), as comparison to maximum likelihood estimation which is a maximum a posteriori with a uniform prior. Here we will apply this extension to the nearest neighbor estimation: SL GRID MAP .
Lastly we will show that our prepaid method can also be used to cover multiple experimental set ups.
Each experimental set up involves multiple conditions and may have varying constraints on the parameters of the model over these conditions. If, for one experimental set up, the conditions c are independent, the likelihood of the whole experiment is where l s,c (θ c ) is the synthetic likelihood for condition c. This is equivalent to estimating each parameter 20 set θ c individually for each condition c. If the conditions are not independent and we assume that some parameters are the same over conditions we adapt the prior to mimic these assumptions. For example, if we assume that the first parameter θ 1 is the same over all conditions C we formulate this as where N is the standard normal distribution andθ 1 is the average of all θ c 1 . The smaller the tuning parameter σ prior , the more all θ c 1 will be forced to be equal. If σ prior is too large the estimation will not take into account the interdependence between the conditions. However, if it is too small we run into trouble with the sparsity of the prepaid grid. In the limit, where σ prior goes to zero, one point is chosen from the prepaid grid leading to equal parameters not only for first parameter but also for the others. σ prior can be easily tuned by simulating the experimental set up for a certain prepaid grid.
To further illustrate, we will apply this method to the Ricker model, assuming two conditions across which r and σ stay the same such that this prior is given by We first use the nearest neighbor approach SL GRID ML to find the 1000 nearest neighbors for condition one and two separately and then we refine the parameters using prior 12. As we assume r and σ to be constant over the conditions, we taker andσ as final estimates for these parameters in each condition.

Test set
As a test set we first used 100 random parameters created with the prior of Equation 5. To avoid problems with the borders we deleted parameters that where within 1% range of the bounds. We simulated data sets for T obs = {10 2 , 5 · 10 2 , 10 3 , 10 4 , 10 5 }. For each data set we estimated parameters using the nearest neighbor (SL Grid ML ) and the SL SVM ML approach. For T obs = 10 5 , we also estimated the parameters using the SL Lin ML approach. Due to time constraints, we only estimated parameters for the data with T obs ≤ 10 3 using the traditional synthetic likelihood approach.
Next we also created test data sets from different priors for T obs = 10 2 . Prior P 1 from Equation 5 can also be written as where Beta is a beta distribution with parameters α = 1 and β = 1. Similarly, we created a test set from and prior P 3 We will test if SL GRID MAP performs best when the correct prior is used in the estimation process. Last we also created a test set for T obs = 10 2 for an experimental set up with two conditions where r and σ are equal over the conditions.

Results
For the results, we will evaluate the methods on the following criteria: accuracy, speed, and coverage.
Accuracy To start off, we look at the recoveries for T obs = 10 3 for all 100 simulated data sets and the three methods (SL Orig ,SL Grid ML and SL SVM ML ). Scatter plots are shown in Figure 6. It can seen that the synthetic likelihood estimation leads to some clear outliers. One possible reason for the absence of outliers in the prepaid estimation is the fact that prepaid estimation from the start examines the whole grid and therefore has less problems with getting stuck in local optima. More generally, we plotted the accuracy of each of the methods as a function of time series length T obs in Figure 7. The left panel shows the root mean square error (RMSE), while the right panel shows the median absolute error (MAE). We decided to look at the MAE because the few outliers for SL Orig (which were shown Figure 6)   The largest attainable accuracy for the SL Grid ML prepaid approach is limited by the spacing of the prepaid grid. If we had created an equally spaced grid of T obs = 10 5 points using the prior in Equation 5, we would have the following gaps in each of the three parameter dimensions: We do not have an equally spaced grid, but it is expected that the quasi Monte Carlo distribution of points creates expected gaps close to the ones in Equation 16. Therefore, it is no coincidence that the best possible RMSE using the SL Grid ML prepaid approach has the same order of magnitude as the gap size ∆, as can be seen in Table 2 for the case of T obs = 10 5 . However, Table 2 also show that the SL SVM ML prepaid approach leads to a much lower RMSE. The difference between the SL Grid ML and the SL SVM ML prepaid approach for T obs = 10 5 is further visualized in Figure 8.
The results in Table 2 also show the need for a non-linear interpolator for the prepaid method. The RMSE of a linear regression interpolator (SL Lin ML ) is much larger than that of the SVM prepaid. In sum, we can conclude that the prepaid estimation methods lead to better, or at least similar, results as the traditional synthetic likelihood.
Speed The largest improvement of the prepaid method over synthetic likelihood is in computational speed: The prepaid method is many times faster than synthetic likelihood. Consider Figure 2 in the main text where it is shown that the SL Grid ML prepaid method is finished before a single iteration of the 30000 iterations are done by the SL Orig method. While the SL Grid ML and the SL SVM ML prepaid methods are finished in respectively 0.044 and 3.7 seconds, independent of the time series length T obs , the SL Orig method grows slower with an order of magnitude of T obs . In each SL Orig iteration one needs to simulate multiple time series with length T obs . The larger T obs , the slower the estimation. While the synthetic likelihood needs approximately one and a half hour to estimate the parameters for a time series with length T obs = 10 3 . The SL Grid ML prepaid estimation still finishes in 0.044 s, which is more than 10 5 times faster. The speed up factors are presented in Table 3 and as can be seen from Figure 7, there is not loss of accuracy. The speed up would reach millions, if we had the time to run the synthetic likelihood method for longer time series.
Coverage Next, we look at the coverage rates of the 95% confidence intervals as obtained with the bootstrap in combination with the prepaid method. To estimate a 95% confidence interval of the estimates for the prepaid method, a parametric bootstrap with B = 1000 bootstrap samples was used.
For the prepaid version the estimate for the observed data set was obtained using the SL SVM ML approach and the bootstrap estimates were commonly obtained using the SL Grid ML prepaid method applied to the bootstrap data sets. However, if in the first 100 bootstraps only half of the nearest neighbors where unique points, the bootstrap distribution could be considered questionable. This behavior is to be expected for larger sample sizes T obs , because the true bootstrap distribution is very peaked so that every bootstrap sample will have the same nearest neighbor grid point. When this occurs, we would estimate the parameters of each bootstrap using differential evolution, using the SVM created by the original 100 nearest neighbors.
Alternatively, for the synthetic likelihood approach (using MCMC) we computed the 95% confidence interval by calculating the 0.025 and 0.975 quantiles of the last half of the posterior samples.
The coverage results for the test set of 100 parameters are shown for three different values of T obs in Table 4. It can be seen that for both methods, the coverage is close to the nominal level of 95%, but the coverage of the prepaid method is slightly better.
Prior In this paragraph we show how we can benefit from using the correct prior. We estimate the parameters of the three testsets for T obs = 100, created with uniform prior P 1 from Equation 9 and beta distribution priors P 2 and P 3 from Equations 14 and 15. We estimated all three data sets using maximum a posteriori estimation SL GRID MAP using all three priors. The results are shown in Table 6. Using the correct prior leads, as expected, to the best results.
Parameter constraints across conditions We estimated the parameters for a two condition experimental set up with equal r and σ, with and without the prior from Equation 12 (parameter σ prior was tuned on 100 similar simulated data sets). The results are shown in Table 7. Using the prior from Equation 12, which implements the parameter constraints of the experimental set up, leads, as expected, to better results for each parameter. Even for φ, which is absent in the prior, we find better results. Table 3: Average time in seconds needed for the SL Orig estimation for multiple T obs and the speed up for the SL Grid ML and SL SVM ML methods. The time for T obs = 10 4 and T obs = 10 5 was not measured, so these values are estimated and between brackets. (Figure 7 shows the corresponding accuracies.)   Real life data set The results for the estimation of the population dynamics of the Chilo partellus [32,26], using the prior from Equation 7 can be found in Table 5. For the prepaid, we estimated the parameters using the methods online at www.prepaidestimation.org. All estimations are similar and have overlapping confidence intervals. The prepaid estimation is however significantly faster.

Application 2: A stochastic model of community dynamics
A second model we will apply our prepaid modeling technique to, is a stochastic dispersal-limited trait-based model of community dynamics [14]. The data that will be modeled, are the abundances of species (hence a vector of frequencies, in which each component is a different species). Each species in the local environment is assumed to have a competitive value dependent on its trait u, given by the filtering function Here A is the maximal competitive advantage, h is the optimal trait value in the local environment and σ describes the width of the filtering function. At each time step, one individual from the local community dies. It is then replaced with a probability 1 − I I+J+1 by a random descendant from the local pool. Here, J is the size of the local community and I is the fourth parameter to estimate, related to the amount of immigration from the regional pool into the local community. The probability that this descendant comes from a certain individual in the local community, is proportional to the competitiveness of this individual.
With a probability of I I+J+1 , the dead individual is replaced by an immigrant from the regional pool. The distribution of traits u of the individuals in the regional pool is assumed to be uniform over u. It is noteworthy that Jabot saw the necessity of reusing ABC simulations to reduce computation time in his recovery study [14].
The model was simulated using the C++ code from the Easy ABC package [16] where a regional pool of S = 1000 species was defined evenly spaced on the trait axis (i.e., the resolution) and J = 500 was the size of the local community.

ABC estimation
We compare our prepaid method estimation with the Easy ABC package (ABC Orig ) [15,16]. Because we work in a Bayesian framework, we first have to specify priors. As in Jabot et al. we use the following priors [16]: (25)) . (18) In this application, the parameter vector θ is defined as follows: θ = (log(I), log(A), h, log(σ)). To get the ABC algorithm to work, we compute four summary statistics: the richness of the community (number of living species), Shannon's index which measures the entropy of the community, and the mean and the skewness of the trait distribution of the community.
The ABC algorithm we use applies a sequential parameter sampling scheme [3]. The sequence of tolerance bounds is given by ρ = {8, 5, 3, 1, 0.5, 0.2, 0.1} and the algorithm proceeds to the next tolerance after 200 simulations which lead to summary statistics within the bounds. The last 200 simulations within the bounds represent the posterior, and the estimate of the parameter is given by the posterior mean.

Creation of the prepaid grid
For the prepaid estimation, we used exactly the same summary statistics as the Easy ABC package. We filled the prepaid grid with 500, 000 parameter vectors using the priors of Equation 18, but for most examples we will use a grid with only 100, 000 parameter vectors. To cover this grid as evenly as possible, the uniform distribution was approximated using Halton sequences [18,17] (in order to avoid gaps that may appear when Monte Carlo samples are used). The creation of the prepaid grid with 100, 000 parameter vectors took approximately 3 days on a 3.4GHz 20-core processor.
For the community dynamics models from Equations 17 and 18, there are several ways to simulate an almost infinitely large data set to achieve stable summary statistics. The first way is to increase the number of species S and the size of the local pool J. Unfortunately some summary statistics (the richness and the entropy) are in some unknown way dependent on these parameters. As a result, the summary statistics of a simulation with J = 5000 cannot be used to estimate the parameters for a setting where J = 500. Therefore, we chose to fix the size of the local pool J and the number of species S. It is very well possible that there are summary statistics which do not have this problem, making the prepaid grid much more universal. We chose however, for the sake of comparison with the easy ABC package to keep using these parameters.
A second way to simulate data with a very large sample size is by increasing the number of time steps.
By estimating the summary statistics after each time step, when one individual from the local community dies and is replaced by another individual, we create a time series of summary statistics. Averaging the summary statistics over a sufficient large number of time points will lead to stable average values of these summary statistics. In our simulations, we applied some tinning by calculating the summary statistics every time after 500 species have died (the size of the community). The reasons is that there is not enough of variation in the summary statistics computed after the death of a single species. Next, we created time series of length T = 100, 000 (5 · 10 7 species will have been replaced) for the prepaid grid and used the average of these summary statistics asμ θ . Using this time series we also computedΣ θ,T prepaid for T prepaid = {1, 10, 1000, 10000}. T prepaid = 1 is of course the setting for which the original trait model is described and for which the Easy ABC algorithm is tested. Additionally we also saved 1000 samples of time series of length T prepaid = {1, 10, 1000, 10000}.

Prepaid estimation
Contrary to the first application (the Ricker model), where we used a frequentist approach, for this community dynamics model we will follow a Bayesian approach. In Bayesian statistics, the focus is on the posterior distribution of the parameters p(θ|data), which is defined as follows: where p(data|θ) is the likelihood and p(θ) the prior. As the likelihood, we will use the synthetic likelihood We have studied three variants of a Bayesian version of the prepaid method. These three versions will be discussed here in increasing order of complexity. We will denote the three variants as follows: SL Grid PM , ABC Grid PM , and ABC SVM PM .

PM
Because the priors are all uniform (and our prepaid grid is distributed following this prior), the posterior for a data set with summary statistic s at parameter θ p of the prepaid grid is proportional to where L PP s (θ) is the prepaid synthetic likelihood (i.e., with the mean statistics computed for a very large sample and a approximate covariance matrix given by Equation 6). The posterior mean (PM) using prepaid synthetic likelihood can be estimated as:θ ABC Grid

PM
The prepaid synthetic likelihood approach works best if the assumption of normally distributed summary statistics is not too far off. However, as can be seen in Figure 9, this is not always the case for the trait model defined in Equation 17. Therefore, as an alternative procedure, we propose an Approximate Bayesian Computation (ABC) approach. First, we select a subset of nearest neighbors S from the prepaid set, such that for every θ q ∈ S, the synthetic likelihood value L s (θ q ) is highest and so that where the sum in the denominator runs across all grid points. In a sense, these are all the prepaid points in the 99.9% expected coverage according to the posterior of Equation 20. We denote the cardinality of S as Q.
In a next step, we basically perform ABC with all the grid points belonging to the selected subset S.
However, there is an important issue we cannot overlook. When doing ABC, for a given parameter vector new data are simulated of the same size as the observed data. Unfortunately, our prepaid grid has correspondingly only very large data sets. To rectify this problem, so that ABC can applied without problems, we simulated during the construction of the prepaid grid, a set of M = 1000 prepaid samples for several designated sample sizes (i.e., T prepaid = {1, 10, 1000, 10000}). Let us denote with s q,i,T prepaid the vector of statistics for prepaid grid point q, the ith simulation (with i = 1, . . . , M ) and sample size T prepaid . Now, we can apply ABC to arrive at the posterior for θ; the method will be denoted as ABC Grid PM . For now we will assume that T obs is equal to one of the T prepaid lenghts. We select the 1000 samples from this 31 Q × 1000 samples set that have the smallest Mahalonobis distance to the observed set of statistics s obs : here W Q is given by the covariance over all grid points in S and over all 1000 replications (thus, Q × 1000).
The finally selected 1000 samples are then considered as a sample from the posterior. Note that the ABC Grid PM method does not require us to progressively strengthen the tolerances, as in traditional ABC Orig (governed by the tolerance parameter ρ). If the observed sample size T obs is not equal to one of the T prepaid lengths, then one can use the samples for length T prepaid which is closest to T obs in logaritmic scale and later adjust the posterior samples such that the posterior mean stays the same, but the posterior covariance matrix changes toΣ We advise to save samples for enough different T prepaid such that this correction is only marginal.

ABC SVM PM
The ABC Grid PM is only based on the raw prepaid grid points. But again, a more accurate estimation can be found by interpolating between the parameters in the prepaid grid. Therefore, we learn the relation between the parameters and the summary statistics using LS-SVM:f svm : θ → s. We only learn this relation in the region of interest, that is, only the 100 nearest neighbors according to the ABC Grid PM approach or more specifically, the 100 prepaid points for which the most samples lead to a small enough ǫ q,i,TABC .
Before we use machine learning to infer the relationf svm : θ → s we cluster these 100 nearest neighbors using hierarchical clustering such that no cluster has more than 50 prepaid points. This is necessary as these 100 nearest neighbors may come from totally different areas in the prepaid grid. This is illustrated in Figure   10.
For each cluster, we first make sure that at least 20 points are included (if not, we add points from the prepaid grid which are closest). Then we estimate thef svm : θ → s using LS-SVM for each cluster c separately, giving rise tof svm,c . Next, we find the minimum volume ellipse encompassing all the points in each cluster. These ellipses inform us about the areas for which the relation holds. Subsequently we resample parameters in each ellipse to zoom in more and more to the regions of interests. In detail, we do the following in every cluster c: 1. Uniformly sample 1000 points θ j,c in the minimum volume ellipse for cluster c. We create a finer grid for each elipse. 7. Go back to step 1, until the worst ǫ j,i does not decrease any more.
Broadly speaking, in step 1, we sample parameters θ j,c , in step 2 to 4 we approximate the summary statistics distribution for each θ j,c using LS-SVM and in step 5 to 7 we trim this set of parameters to only keep the parameters with a high posterior probability.
In the end we combine all the samples, we build the posterior with the parameters from the 1000 best samples over all clusters according to Equation 23. Note that some parameters may show up several times in this posterior sample. To compute the posterior mean, we use a weighted version of these samples. The weights are given by the volume of the ellipse from the cluster where they were created. This is necessary to ensure the correct use of the equal prior for all clusters.

Test set
To generate the test set, we follow the same logic as in [14]. We use the prior in Equation 18 to generate 1000 random parameter sets, except for h, where we changed the prior with the following generating distribution: such that 0 and 100 are the true minimum and maximum optimal trait values for communities. By taking the prior for h as in Equation 18, we avoid boundary effects. To exclude other problems at the borders of the parameter space, we deleted parameters which where within 1% range of the bounds. We simulated data sets for both T obs = 1 and T obs = 1000.

Results
Accuracy Let us first look at the results for T obs = 1. We have used traditional ABC (ABC Orig ), prepaid Bayes approach based on the synthetic likelihood (SL Grid PM ) and prepaid ABC based on separately generated samples at the grid points (ABC Grid PM and ABC SVM PM ). We have used 10 5 and 5 · 10 5 prepaid grid points. The RMSE and MAE can be found in Tables 1 and 8. All methods result in accuracies that are equally large. For 3 out of 4 parameters (except for h), the prepaid method outperforms ABC Orig with respect to RMSE. For MAE, the prepaid method uniformly outperforms the Easy ABC package (ABC Orig ). Overal, the difference between Ω = 10 5 and Ω = 5 · 10 5 prepaid grid point is very small for the prepaid methods.
We have refrained from interpolating with the LS-SVM because the 99.9% coverage includes on average more than 1000 points. This is perfectly logical because T obs = 1 does not provide a lot of information, and, as a consequence, there is a lot of uncertainty (which translates itself into a large number of parameter points that have a reasonable large synthetic likelihood value). As a result, creating a posterior based on only 100 nearest neighbors (even after interpolation) would not suffice because too many parameter points with high posterior density would be missed.
For T obs = 1000 (see again Tables 1 and 8), the accuracy increases, as is expected (this can be seen both in the RMSE as in the MAE). In this case, both increasing the number of grid points Ω and using LS-SVM interpolation increases accuracy. No results are given for ABC Orig , because it is impossible to fit the model with this sample size in acceptable time.
Speed For T obs = 1, the estimation time of ABC Orig is 3865 s. In contrast, the estimation time of ABC Grid PM is 0.167 s. This means that the prepaid ABC method is approximately 23000 times faster than traditional ABC.
Coverage For both the ABC Orig as well as the prepaid versions we end up with a posterior sample. We computed the coverage by calculating the 0.025 and 0.975 quantiles of the posterior samples. Next, we checked whether the true parameter was in this interval or not. Note that when we use clustering during ABC SVM PM , we weigh each point proportional to the volume of its originating cluster. For the SL Grid PM approach we use the whole prepaid set as posterior and us weights according to Equation 20.
For T obs = 1 and T obs = 1000, coverage results can be found in Table 9. For T obs = 1, ABC Orig leads to better coverages than SL Grid PM . Also the ABC Grid PM method gives good coverages (around the nominal level of 0.95) for T obs = 1, but these coverages deteriorate for T obs = 1000 if no interpolation is used (coverage is a bit better for 5 · 10 5 grid points). When the LS-SVM interpolation is applied (i.e., ABC SVM PM ), coverages become very good again, certainly for the largest number of grid points.

Application 3: The Leaky Competing Accumulator
Elementary decision making has been studied intensively in humans and animals [13]. A common example of an experimental paradigm is the random-motion dot task: the participant has to decide whether a collection of dots (of which only a fraction moves coherently; the others move randomly) is moving to the left or to the right. The stimuli typically have varying levels of difficulty, determined by the fraction of dots moving coherently.
Assuming there are two response options (e.g., left and right), the Leaky Competing Accumulator consists of two evidence accumulators, x 1 (t) and x 2 (t) (where t denotes the time), each associated with one response option. The evolution of evidence across time for a single trial is then described by the following system of two stochastic differential equations: where dW 1 and dW 2 are uncorrelated white noise processes. To avoid negative values, the evidence is set to 0 whenever it becomes negative: Equation 26 describes the evolution of information accumulation for a two-option choice RT task, given the presentation of a single stimulus. For all stimuli, the total evidence is equal to v, but the differential evidence for option 1 compared to 2 is 2∆v i , which is stimulus dependent and reflects the stimulus difficulty.
In this example, we assume the stimuli can be categorized into four levels of difficulty, hence i = 1, . . . , 4.
The model gives rise to two separate choice response time probability densities, p 1i (t) and p 2i (t), each representing the response time conditional on the choice that was made. Integrating the densities over time will result in the probability of choosing the response options:  This parametrization is known to have one redundant parameter [21], so we choose to fix c = 0.1.

The re-parametrization
The prepaid method will not be applied to the model as presented in Equation 26, but rather on a reparametrized formulation: again with the additional restriction that x 1it = max(x 1it , 0) and x 2it = max(x 2it , 0). The new parameters are defined as follows in terms of the original ones: This new parametrization has the advantage that D can be interpreted as an inverse time scalar because doubling D makes all choice response times twice as fast. This property will allow us to reduce the dimensionality of the prepaid grid (see below). The parameter v ′ > 0 denotes general stimulus strength scaled according to D, while parameter C i (for coherence) denotes the amount of relative evidence encoded in the stimulus i: −1 < C i < 1. It is commonly assumed for these evidence accumulator models that different stimuli should lead to different coherences C i , but without affecting the other parameters. The nondecision time T er is not transformed.

Creation of the prepaid grid
For the delineation of the parameter space, we will follow the specifications of [21]. Because this parameter space is rather restrictive (a consequence of the recommendation of [21] to improve parameter recovery), we will extend it through the use of a time scale parameter. This extension will be further discussed when introducing the test set.
First, we create a prepaid grid on a four-dimensional space in the original parametrization by drawing from the following distribution: We select 10000 grid points from this distribution using Halton sequences [18,17]. When working in the reparametrized version, as defined in Equation 27, this space can be transformed to a four dimensional space of v ′ , γ ′ , κ ′ and D.
However, because D acts an inverse time scalar on the response time distributions, we may also consider the three dimensional space formed by v ′ , γ ′ , and κ ′ and for each grid point, choose the parameter D in such a way that the RT distributions for options 1 and 2 are scaled to fit nicely between 0 and 3 seconds (with a resolution of 1ms and 3000 time points so that about 0.0001 of the tail mass is allowed to be clipped at 3 seconds when C = 0). Effectively, this brings all RT distributions to the same scale (denoted as s = 1).
This process of scaling is illustrated in Figure 11. It reduces both the number of simulations and the storage load (without it we would have to simulate and store a separate set of distributions for each value of D).
Note that the scaling is done jointly for all RT distributions associated with a particular g. The resulting diffusion constant corresponding to the rescaled distribution is denoted as D g 0 . In addition, the construction effectively removes one parameter from the prepaid grid, which is illustrated in Figure 12.
To include the coherence parameter, we extend each grid point with a set of predefined coherences. For each point g = (v ′ , γ ′ , κ ′ ) in the grid, we take 50 equally spaced coherences C g k (with k = 1, . . . , 50) from 0 to the maximum coherence that still has some non-zero chance of choice option 2 to be selected (we take 0.001). Finally, we simulate for each combination of g = (v ′ , γ ′ , κ ′ ) and C g k a large number of choice response time data (choices and response times). This is illustrated in Figure 11.
In a last step, grid points are eliminated from the prepaid grid, if the simulations result in too many simultaneous arrivals (i.e., trajectories that end at or very close to the intersection point of the two absorbing boundaries at the upper right corner, located at (a, a)). More specifically, we drop grid points with more than 0.1 percent simultaneous arrivals. Creating the prepaid database took less then a day on a NVIDIA GeForce GTX 780 GPU.

Prepaid estimation
To explain how the prepaid estimation of the LCA works, let us start with a prototypical experimental design. Assume a choice RT experiment with four stimulus difficulty levels (e.g., four coherences in the random dot motion task). Each difficulty level is administered N times to a single participant. A particular trial in this experiment results in (c ij , t ij ), where i is the stimulus difficulty level (i = 1, . . . , 4) and j is the sequence number within its difficulty level (j = 1, . . . , N ). The data resulting from this experiment are responses c ij (referring to choice 1 or choice 2) and response times t ij . Each pair (c ij , t ij ) is considered to originate from an unknown parameter set (v ′ , γ ′ , κ ′ , D, T er ) and coherences C i (i = 1, . . . , 4).
Our first aim is to is to establish a local net of prepaid points that lead to data that are close to the observed dataset. If necessary, we can further zoom in with the help of support vector machines. Conditional on each prepaid parameter set g in the basic grid, a number of the remaining parameters can be integrated out beforehand. First, conditional on grid point g, we have for 50 predetermined coherences C g k simulated accuracies and response time distributions (see Figure 11). The coherences of the observed data can be estimated solely using the observed accuracies using simple linear interpolation. The estimated coherence for stimulus (or condition) i is denoted asĈ i . Corresponding to each of the 50 coherences C g k for grid point g, there is a pair of corresponding simulated RT densities p g ic (t) (with c = 1, 2). As before, p g ic (t) is scaled to the [0, 3] seconds window, and we can use a combination of translating (estimatingT er ), scaling (estimatinĝ D) and interpolating. Specifically, we first calculateŝ as the optimal time scalar to match data with the model on grid point g:ŝ This formula capitalizes on the fact that the variance of a distribution does not change when it is simply shifted to the right by a constant. Hence, the ratio of the model's decision time variance (without T e r) and the observed total response time variance (presumably shifted with some T e r) is still an estimator of the squared scale factor between them. Using this information, we can estimate the optimalD andT er for grid point g as follows:D with D g 0 being the optimal scaling diffusion constant used for optimal storage in the database. This gives us a final effective parameter vector of v ′ , γ ′ , κ ′ ,Ĉ 1 ,Ĉ 2 ,Ĉ 3 ,Ĉ 4 ,D, T er . Note that the last 6 elements of this vector are estimates conditional on the grid point g = (v ′ , γ ′ , κ ′ ).
Next, we have to determine the single optimal parameter set (and thus also the optimal v ′ , γ ′ , and κ ′ ).
For this we need an objective function that compares the model based PDFs with those of the data. For this purpose, we use a (symmetrized) chi-square distance based on a set of bin statistics. For each stimulus' observed set of choice RTs, t i = (t i1 , t i2 ) (with t i1 the RTs for option 1 and t i2 for option 2), we calculate 20 data quantiles q u (with u = 1, . . . , 20) at probability masses m i = 0.05 · i. The set of quantiles is appended with one extra quantile q 0 at m 0 = 0.01 to have a more detailed representation of the leading edge of the distribution. Based on binning edges (0, q 0 , q 1 , . . . , q 20 , +∞), we create 4 × 2 × 22 bin frequenciesb icw with w = 1, . . . , 22. The corresponding probability masses m g icw can be easily extracted from the prepaid PDFs p g ic (t) as well. Observed and theoretical quantities can then be combined in the a symmetrized chi-square distance:

40
This defines a distance between all grid points g in the database and any data set.
In the following paragraphs we will present three ways of using this distance to calculate LCA estimates, each a bit more complicated than the previous one (but also more accurate): CHISQ Grid NN , CHISQ Grid BS , CHISQ SVM BS .
CHISQ Grid

NN
The grid point closest to the data set (as measured by the symmetrized chi-square distance function) can be used as a first nearest neighbor estimate.
CHISQ Grid BS Not all parameters are treated equally in the estimation procedure. The parameters C i , D and T er are estimated conditionally on all grid points g and then the other parameters are estimated conditionally onĈ i ,D andT er . Moreover, these parameters are chosen in such a way that a specific aspect of the data (e.g., proportion of choices for option 1) is fitted perfectly (i.e., the coherence is chosen to result in probabilities perfectly equal to the proportions observed in the data). This would be no problem for an infinite amount of data. However, for finite data, the major disadvantage of this way of working is that any errors induced in the precursor step are propagated through the estimation process for v ′ , γ ′ and κ ′ . This is because for finite data, the observed accuracies will typically not exactly coincide with the accuracies provided by the best model estimates. As the estimatesĈ i are (on each grid point) exactly fit to the observed accuracy and consequently, the effective grid points will all have this exact accuracy. We tackle this estimator bias by non-parametrically bootstrapping the data and repeating the nearest neighbor estimate for every bootstrapped dataset. Taking the mean of this set of estimates (a method known as bagging; [11]), gives us a more accurate estimate. Additionally, we now have a standard error of the estimate (and confidence interval).

CHISQ SVM BS
If we apply the bootstrap procedure, it may turn out that the selected grid points as nearest neighbor are not very diverse (this may happen with large sample sizes). In such a situation, it can be worthwhile to use an interpolator. So we may learn a support vector machine based on the bin statistics of the few unique bootstraps grid points available, together with the best overall unique grid points. We propose to use a training set of 100 grid points in total. The SVM can then be used as an approximation for the bin statistics in the space between the grid points and hence for the objective function. We subsequently minimize the approximative SVM based objective function for every bootstrap, using differential evolution (as has been outlined above for the other applications).
Obviously, the quality of the SVM based estimate is limited by the quality of the SVMs that are trained to 41 learn the relation between parameters and statistics. In addition, the same SVMs are used for all bootstrap samples, which may introduce an unwanted distortion in the uncertainty assessment. To account for the systemic bias that might have been introduced by the SVMs, we will add some random noise to each bootstrap estimate. The amount of random deviation that is added equals the size of the prediction error of the SVM.
In this way, low quality SVMs are prohibited of biasing all bootstraps in the same way. The uncertainty of the SVMs is now incorporated in the final bootstrapped results.

Test set
The test set is created by uniformly sampling parameters according to Equation 28. Input differences v1−v2 2 are chosen to produce typical accuracies of 0.6, 0.7, 0.8, and 0.9. As is done in [21], excessively long PDFs (with a maximum RT larger than 5000ms) and excessively short PDFs (with a range below 400ms) are removed from the test set. Apart from the fact that these PDFs are deemed unrealistic [21] for typical choice RT data, this part of the parameter space suffers from inherent poor parameter identifiability, with very large confidence intervals and less meaningful estimates as a consequence. Because the new parametrization analytically integrates out scale (i.e., D) (and also shift T er ), and is positively unbounded in these dimensions, we can expand the test set to cover a broader range of distributions than the ones covered in [21]. To broaden the range of the test, the distributions are scaled with a random factor ranging from 0.2 to 5. We will use this broadened test set to determine the method's accuracy and coverage.

Results
Accuracy The recoveries of the original LCA parameters are displayed in Figures 4, 13, and 14. It can be concluded that for all sample sizes, recovery is acceptable, but it improves a lot for larger sample sizes.
In all cases, the recovery is dramatically better than that reported in [21]. Figures 16 and 15 shows RMSE and MAE, respectively, as a function of sample size for three methods (for all parameters). It can be seen that accuracy improves for all parameters for the single best nearest neighbor and for the bootstrap method, until some point, after which it stabilizes or deteriorates. However, for the SVM based estimation, there is still considerable improvement for higher sample sizes.
Coverage Figure 17 shows the coverages for different numbers of observations. Nearest neighbor bootstrap coverage seems to be adequate for sample sizes up to 10000; for higher sample sizes SVMs are needed to ensure good coverage.