A New Calibrated Bayesian Internal Goodness-of-Fit Method: Sampled Posterior p-Values as Simple and General p-Values That Allow Double Use of the Data

Background Recent approaches mixing frequentist principles with Bayesian inference propose internal goodness-of-fit (GOF) p-values that might be valuable for critical analysis of Bayesian statistical models. However, GOF p-values developed to date only have known probability distributions under restrictive conditions. As a result, no known GOF p-value has a known probability distribution for any discrepancy function. Methodology/Principal Findings We show mathematically that a new GOF p-value, called the sampled posterior p-value (SPP), asymptotically has a uniform probability distribution whatever the discrepancy function. In a moderate finite sample context, simulations also showed that the SPP appears stable to relatively uninformative misspecifications of the prior distribution. Conclusions/Significance These reasons, together with its numerical simplicity, make the SPP a better canonical GOF p-value than existing GOF p-values.


Introduction
Statistical model criticism, which tests a fitted statistical parametric model against observed data, is valuable for gaining more confidence in the statistical results [1][2][3][4][5]. Box [6] identified model criticism as one of the two main steps in statistical model development. Although many other terms have been used -model adequacy, model checking, model validation, model evaluation [3,5] -, we will use the term goodness-of-fit to refer to this confrontation between statistical model and observed data. To date, the generally preferred method has been external goodness-offit, where data used to assess the model are not those used to fit the model. The evaluation is performed either through data splitting or by comparing the model predictions against a completely different dataset [5]. External goodness-of-fit avoids using the data twice, and should result in more interpretable and less circular goodness-of-fit [7,8]. However, many researchers have proposed internal goodness-of-fit methods (see later), where predictions from the fitted model are compared with the observations that were used to estimate the parameters of the model. One obvious advantage of internal goodness-of-fit (GOF) is to allow fuller use of data in model checking. We will therefore focus our attention on these methods, and more precisely on GOF p-values. The GOF pvalues we use are Fisherian p-values, i.e. probabilities of ''seeing something [with the statistical model] as weird or weirder than you actually saw'' [9]. Fisherian p-values compare the model to the data, and therefore differ from Neyman-Pearson tests which compare two models or hypotheses [9]. ''Weirdness'' is quantified using specific discrepancy functions, which are real-valued functions of data and of statistical model parameters. Fisherian p-values are simply calculated as the quantile of the discrepancy function calculated on the observed data in the probability distribution of discrepancy functions of data and parameters randomly generated according to some given probabilistic scheme associated to the fitted statistical model. Let us assume that, when replicating over hypothetical datasets sampled from a probabilistic model, we know these p-values have a uniform distribution on 0; 1 ½ under assumption (A1): (A1) the likelihood in the statistical model -or inference model, used to analyze data -is the same as the likelihood in the probabilistic model -or sampling model, used to generate data; then an extreme Fisherian p-value -i.e. a p-value very close to 0 or a p-value either very close to 0 or to 1, depending on the discrepancy function -is interpreted as contradicting (A1). The reader will find the mathematical formulation of these statements at the beginning of the Material & Methods section. When the statistical model is fitted with Bayesian methods, these GOF p-values clearly rely on both Bayesian and frequentist ideas: they are Bayesian because the statistical parameters come either from the prior or the posterior distribution, or modifications thereof, and they are frequentist because they embed the observed data within a set of unobserved datasets sampled from a probabilistic model. This is why such methods are called calibrated Bayesian [10]. Calibrated Bayesian GOF has progres-sively gained popularity over the last few decades, resulting in a number of more or less sophisticated techniques [6,[11][12][13][14][15][16][17][18]. Calibrated Bayesian GOF differ from classical purely Bayesian methods that specify a family of alternative, more complex models and use Bayes Factors to indicate which family of models -the original or the alternative models -is the most likely [6,19]. Even though this purely Bayesian method does have some interesting features (e.g. discussion in [13]), it cannot deal with the Fisherian view of model checking, i.e. testing whether the data are consistent with a given model, without the need for an alternative hypothesis [9,10,20]. What if both the original and the alternative models were inconsistent with the data? Huber [19] qualifies these purely Bayesian procedures as 'tentative overfitting', commenting that these Bayesian methods ''are based on the unwarranted presumption that by throwing in a few additional parameters one can obtain a perfectly fitting model. But how and where to insert those additional parameters often is far from obvious (...). Remember that Kepler rejected the epicyclic Ptolemaic/Copernican models because he could not obtain an adequate fit within that class.'' In turn, we note that emerging Bayesian GOF methods involve nonparametric alternatives [21][22][23], thus enriching the Bayesian GOF toolbox.
Given that frequentist statistics are believed to be more powerful than Bayesian statistics for model criticism [6,12], Little [10] viewed calibrated Bayesian p-values as an improvement over purely Bayesian p-values -and in this article we will indeed focus on calibrated Bayesian techniques. The Material and Methods section begins by proposing a brief overview of what is known on frequentist and calibrated Bayesian GOF p-values under assumption (A1) according to three criteria: -C1: asymptotically with respect to sample size, the probability distribution of the p-value when replicating over observed datasets should be known for a variety of discrepancy functions and priors; -C2: under reasonable finite sample sizes, the probability distribution of the p-value when replicating over observed datasets should be close to a known reference distribution for a variety of discrepancy functions and priors; -C3: the p-values should be numerically inexpensive and relatively easy to implement based on a Monte Carlo Markov Chain or frequentist model fit [3,16].
Conditions (C1) and (C2) are required in order to use candidate GOF p-values as described above in the Fisherian perspective.
Having p-values that work for very different probability distributions and any discrepancy function has an obvious advantage: it provides users with assurance that they can use the method for different kinds of statistical models, and that they have sufficient flexibility to check the model [4,15,20,24]. Condition (C3) is motivated by time constraints in the application of such methods. As will be seen in the Material and Methods section -to which point we defer a precise definition of the p-values -some calibrated Bayesian and classical frequentist GOF p-values share the difficulty that their probability distribution is generally unknown, even asymptotically; this contradicts (C1), which makes it difficult to interpret the surprise resulting from a given p-value [14][15][16][17]. For this reason, posterior predictive p-values (p pop ) [4,13], which are possibly the most widely used in modern applied Bayesian settings, have come under challenge from the statistical literature [14,15,17]. Other calibrated Bayesian GOF p-values prove very computer-intensive -thus contradicting (C3). Finally, most of them do not apply to general discrepancy functions -thus contradicting (C1) and (C2). Three of the reviewed p-values -the prior predictive p-value (p prp ; [6]), the plug-in half-sample ML pvalue (p ML hs ; [25]) and the normalized sampled posterior p-value (p nsp ), developed in [16,18] -meet these three criteria, provided we have the same prior and likelihood in the data analysis as we had when generating data -for p prp and p nsp , and provided that the discrepancy function depends solely on normalized data -for p nsp , on uniformized data for p MLhs -or on data -for p prp . Normalized data are simple transformations of the observed data that: (i) calculate uniformized data in 0; 1 ½ , which are the values of the empirical cumulative distribution at observed valuesbased on the probability distribution used in the statistical likelihood and on a suitable parameter value; (ii) calculate the inverse cumulative function of the standard normal distribution on these uniformized data (cf. legend of Table 1 for a mathematical formulation).
The mathematical results we know for p ML hs are limited to uniformized data. Also, we know that, in general, p prp strongly depends on the prior chosen in data analysis, which is not the case for p ML hs . But is this also the case for p nsp ? Indeed, what happens to p nsp when the prior used in data analysis does not correspond to the prior used in data generation? Also, what happens when discrepancy functions are more general, i.e. dependent on statistical parameters or on unnormalized data -which leads to NOTE: y denotes a vector of length n, composed of random numbers from the uniform distribution that are independent from each other and from all the other random variables considered. F 0 denotes the cumulative distribution function of the standard normal distribution, and F :Dh,y ð Þdenotes the cumulative distribution function F :Dh ð Þ of the density f :Dh ð Þ of the model -or a randomized version of it when X is discrete: where g is a small positive number so that X i {g remains bigger than the closest smaller discrete value to X i . Normalized data are defined as . We considered the following t functions: mean, variance, and only in the case of unnormalized data, p 0 X ð Þ~X where Y Ã i À Á denotes the ascending ordered version of Y. Z a is obtained as Z t dw t ð Þ, with the likelihood ratio statistic as Z t and an adequate weight function w t ð Þ [37]. Centered mean and variance are the empirical mean and variance minus the mean and variance expected with h. doi:10.1371/journal.pone.0014770.t001 the more general sampled posterior p-value (p sp )? And does p sp or p nsp apply to discrete data? Finally, are p sp or p nsp more powerful than p pop for detecting discrepancies between the data and the statistical model in situations when the likelihood in the statistical model is not the same as the likelihood in the probabilistic model? And how do p sp and p ML hs compare in such situations? In the second part of the paper, we study the promising p-values p sp or p nsp both mathematically and through simulations. Our main results are that: (i) p sp meets criterion (C1); (ii) provided the prior distribution in the statistical analysis is equally or less informative than the prior in the probabilistic model, simulations on simple models indicate that p sp has an approximately uniform distribution and fulfils criterion (C2) with sample size from several dozens to several hundreds; and (iii) based on a specific example, p sp and p nsp are shown to be more powerful p-values than p pop and as powerful as p MLhs .
This yields an easier way of calculating GOF p-values than the methods proposed in [7,14,17,26]. In the last part of the paper, we discuss the benefits and drawbacks of this new p-value. Leading out of this discussion, p sp , p nsp and p ML hs appear to be preferable to p pop and other p-values.

Review of published results
For the sake of simplicity, this section will concentrate only on the mathematical setting for continuous observations. The case of discrete valued observations will be dealt with in the next section.
Suppose that we have observed a realization x obs of a random variable X, X[R n . We propose a parametric probability family model, f XDh ð Þ, h[H5R p , for the density of X given h, and a prior probability distribution p h ð Þ for h. Although some of the results in this paper might also extend to cases where the prior is improper and the posterior is proper, we will assume (A2) throughout, i.e.: (A2) the prior distribution is proper. This paper will walk us through an investigation of the fit of the above statistical model with the observed data x obs . We do so by comparing the distribution of a given discrepancy function d X,h ð Þwhere X and h are simulated in some way from the statistical model -with the value involving observed data, d x obs ,h ð Þ, using the Fisherian p-value: as a measure of compatibility, where m : ð Þ:m X,h ð Þ is a reference probability density for X,h ð Þ that depends on the statistical model. Each GOF p-value is defined by a reference density m and a discrepancy function d [15,20]. When the discrepancy function d does not depend on h, Robins et al. [15] propose to shift terms and call d a test statistic function.
Our setting has so far been purely Bayesian. The frequentist part of the setting is defined by a probabilistic model for the random sampling of data x,m 0 according to a given density based on the parametric probability family model, f 0 :Dh ð Þ, and on a prior probability distribution p 0 h ð Þ -which can be a Dirac or point mass distribution. Following many authors [14][15][16][17]27], we require that, under (A1) (i.e. f~f 0 ), the probability distribution of p m,d x obs ð Þ ½ x obs *m 0 be known at least asymptotically -i.e. when the size n of x obs tends to infinity -and, more precisely, that this distribution be the uniform distribution on 0; Such GOF p-values will hereafter be called asymptotically uniform. The classical p-values proposed in the literature meet criterion (C3). They correspond to the following reference densities: -the plug-in ML density: m ML X,hDx obs ð Þf XDh ð Þ dĥ h (:), where dĥ h (:) is the Dirac function atĥ h, which is the Maximum Likelihood Estimator (MLE) of h -given x obs and the likelihood f. Even though other values than the MLE can be used for h in a plug-in p-value (cf. [16]), this is a reference density that is used at least implicitly in many frequentist diagnostic tools (cf. graphical tools in [1,2]); -the prior predictive density:  [12,13].
This paper will not go further in investigating the prior predictive p-value -dubbed p prp -because of its strong dependence on the statistical prior p, in contradiction with (C2) [10,14] (also see Text S6). With p 0~dh0 for some fixed h 0 , and under the general assumption that the function d is a function of X alone that has a normal limiting distribution, Robins et al. [15] showed that the plug-in ML and posterior predictive p-values -respectively dubbed p ML and p pop -are asymptotically uniform when the asymptotic mean of d X ð Þ does not depend on h. If the asymptotic mean of d X ð Þ depends on h, then as shown by Robins et al. [15], p ML and p pop are generally not asymptotically uniform: more precisely, they are conservative p-values, which means the probability of extreme values is lower than the nominal probabilities from the uniform distribution. These p-values therefore only fulfill criterion (C1) if we greatly restrict the discrepancy functions considered.
This has led to the development of other p-values associated with less classical densities m, among which: -the post-processing method of the posterior predictive p-value to render it a uniform p-value [17]; -the partial posterior predictive density: -what we hereby term the sampled posterior p-value (p sp ) developed in [16,18], based on m sp X,hDx obs ð Þf XDh ð Þdh h (:), whereh h is a unique value of h, which is a random sample of the posterior distribution P pop :Dx obs ð Þ.
With p 0~dh0 for some fixed h 0 , it has been proved mathematically, under certain assumptions, that the partial posterior predictive and conditional predictive p-values are asymptotically uniform p-values whatever the test statistic function and the prior distribution [15], thus fulfilling criterion (C1), with restrictions on discrepancy functions. However, due to criterion (C3), we will consider neither the partial posterior predictive density, nor the conditional predictive p-value [15,16] nor the post-processing method in [17] in this paper. Durbin [28] showed that the plug-in half-sample ML p-value p MLhs was asymptotically uniform provided it was used on uniformized data and with specific test statistic functions. This p-value has seldom been adopted, although Stephens [25] stressed its usefulness.
Johnson [16] proved that for a specific discrepancy measure, the sampled posterior p-value is also asymptotically uniform. More recently, Johnson [18] showed that if: -the statistical model -including the prior p -is the same as the probabilistic model -including the prior p 0 -from which the data were sampled; and -G s ð Þ: then p sp is not only asymptotically uniform, but is also uniform whatever the sample size. Normalized sampled posterior p-values (p nsp ) that use test statistics on normalized transformations of X possess this property. These p-values thus fulfill criteria (C1) and (C2) but with restrictions on discrepancy functions and on the prior distribution, as p must be equal to p 0 .

Simulation setting
What do we know about p sp , with more general discrepancy functions? We will show in the Results section that, for any discrepancy function, p sp is uniform for p~p 0 and asymptotically uniform for p=p 0 , including for discrete-valued discrepancy functions. We also wanted to include discrete-valued discrepancy functions, due to the discrete nature of either the random variables X or the discrepancy function. We will therefore consider the following modified p-value: where e is drawn from a uniform distribution, independently of the other random variables. Based on the mathematical results to come, p sp appears a promising p-value that applies widely in terms of discrepancy functions, and -asymptotically -in terms of prior distributions. However, these results no longer hold when the land of asymptotia is obviously not reached, as can be the case in hierarchical models or in models that fit parameters with a limited number of observations (see, for instance, the last model in the Poisson example in [16]). Furthermore, when sample size is moderate and the statistical prior does not correspond to the data generation prior, we have no clear information on how close p sp is to being uniform. We therefore used simulations to study how p sp behaves in a finite sample context under four scenarios.
Objectives and scenarios. Our first scenario was performed to illustrate the uniformity results in the Results section when f~f 0 and p~p 0 , while the three other scenarios were conceived to study in a finite sample context the distance to uniformity of the empirical distribution of p sp , p msp,d x obs ,e ð Þ Â Ã xobs*m0,e *U(0;1) , for different kinds of discrepancies between the probabilistic and statistical prior distributions: Scenario 1: Perfect fit between the probabilistic and statistical models. Here, the model that generated the data and the model used to fit the data were exactly the same -including for the prior distribution.
Scenario 2: The statistical and probabilistic models differ only by the dispersion of their priors.
Scenario 3: The statistical and probabilistic models differ only by the centering and dispersion of their priors.
Scenario 4: The statistical and probabilistic models differ only by their priors, the probabilistic prior p 0 being a Dirac distribution. This setting is the same as in Scenario 1, except that data were generated from fixed parameters chosen at the mean of their statistical prior under Scenario 1.
Finally, we compared p sp with p pop and p ML hs under Scenario 4 and a modification of Scenario 4 in which f =f 0 to illustrate the conservativeness of p pop and the potentially good properties of p ML hs under Scenario 4, and to study the difference of power between the three p-values.
Models and methods. We dealt with these issues on the following parametric models, both for data generation and data analysis, which involved conjugate priors [4] (also see Table 2): For each dataset, a 0 , b 0 , h 0 and s 0 were held fixed in data generation and data analysis but were allowed to differ between the two phases. As conjugate priors were used, the explicit formula for the posterior distribution was known [4] and thus used under R 2.2.1 software [29] to fit the Bayesian models to the data.
Under Scenario 1, the priors were as above, with some parameters held fixed and some parameters that were capable of varying between datasets: -for the Poisson model, constant mean and random index of dispersion of the Gamma prior: a 0 =b 0~h0~e xp 1 ð Þ and a 0 =b 2 0 À Á = a 0 =b 0 ð Þ~r 0 *Uniform 0; 2 ð Þz:05; -for the Normal model, constant mean and random variance of the prior for 1=s 2 : a 0 =b 0~1 and a 0 =b 2 0~r0 *10 Uniform 0;1 ð Þ{1 ð Þ , and constant h 0~0 , s 0~1 in the prior for h; and Table 2. Summary of the models considered in simulations. We considered three kinds of discrepancy function, d X,h ð Þ, i.e. test statistics, test statistics on normalized data and other discrepancy functions (cf. Table 1). Test statistics on normalized data were introduced because they define pivotal quantities used by [18] to find results under the condition p~p 0 .
The number of observations in each dataset, n, was a random figure between 20 and 1,000: n*exp 3:45 Ã U 0; 1 ð Þz3 ½ with probability 0.7 and n*U 250; 1,000 ð Þwith probability 0.3. n was rounded to the nearest ten or -if the value was above 200 -to the nearest hundred. We used 5,000 sampled values of X to calculate p-values. The programs were run either on a DELL Latitude D830 Intel Centrino T7250 or on a server with two dual-core Opteron 2.2 GHz processors and 3 Gb of RAM. One hundred thousand replicated datasets were studied under Scenarios 2 to 4 and 10,000 under Scenario 1. To illustrate the dependence of p prp on the statistical prior distribution, we also calculated the p prp based on 3,000 datasets under Scenario 2.
The p-value associated to each dataset and each chosen discrepancy function differed from the classical calculation for predictive p-values. Let us denote: where e is a random value from the uniform distribution. Instead of the classical formula a=(azb) [4], the p-value was drawn at random from the beta distribution with the respective shape parameters az1 and bz1. Indeed, it can be shown that this distribution is the posterior distribution of the underlying p-value once we have observed or sampled X j À Á j ,h h, e and x obs , provided the prior of the p-value is uninformative [4] (p.40). In contrast, the use of a=(azb) can result in significant departures from the uniform distribution, which would be due to the calculation method and not to the underlying p-value; this would especially occur with a low number of replicated data X j À Á j or to estimate the tails of the uniform distribution (see Text S9).
The resulting p-values were considered as sampled from the distribution p msp,d x obs ,e ð Þ Â Ã xobs*m0,e *U(0;1) . They were numerically compared with the uniform distribution, through Kolmogorov-Smirnov tests, which are adequate and easy to calculate for such continuous valued distributions, as well as through binomial twosided tests for the proportion of p-values that were in the 5% or 1% extremities of the 0; 1 ½ interval. As stated above, we used a uniform random number e' to ventilate between the ''less extreme'' and ''more extreme'' categories, the probability of the event when the proportion simulated from the binomial distribution was equal to the observed proportion. This guaranteed a uniform distribution of the associated p-value. For the proportion of p-values in the 5% or 1% extremities of the 0; 1 ½ interval, we also calculated the posterior density of the estimated proportion from the observed number, using a beta distribution as above. We then analyzed where the posterior estimates were positioned relative to intervals around the target probabilities of 5% or 1%. For example, we distinguished cases where 95% of the estimates of the underlying proportion of p-values fell in the interval ½0; :04½ (proportion of pvalues is estimated to be non-negligibly less than 5%), from cases where 95% of the estimates fell in the interval :04; :06 ½ (proportion of p-values is estimated to be negligibly different from 5%), and from cases where 95% of the estimates fell in the interval :06; 1 (proportion of p-values is estimated to be non-negligibly greater than 5%) (see Text S1).
Comparing p sp with p pop and p MLhs under the Poisson model. Finally, for the Poisson model, we compared p sp with p pop and p ML hs under Scenario 4 and a modification of Scenario 4 in which f =f 0 . We used the same test statistics as above, plus the maximum function. Forty-thousand datasets were generated as in Scenario 4 or from a Polya distribution [30] with a maximum value n max drawn at random from the values 4 and 5, and a mean and variance equal to those of the aforementioned Poisson distribution. The sample size was drawn at random from between 20 and 50, except for Figure 1 where it was sampled from the set (20,30,40,50,60,70,80).
The R commands to run and analyze the simulations described above can be found in Text S8.

Results
The sampled posterior p-value: mathematical results p sp is uniform when f~f 0 and p~p 0 . The following lemma extends Johnson's [18] results on test statistics applied on normalized data to general discrepancy functions, including discrete-valued discrepancy functions: Lemma: Assume that p~p 0 is proper, and f~f 0 -so that assumptions (A1) and (A2) are met. Then, for every discrepancy function d, the probability distribution of p msp,d x obs ,e ð Þ Â Ã xobs*m0,e *U(0;1) is uniform, i.e. Proof. The proof of this Lemma follows the same line as the proof of the Lemma in [18]. For the sake of clarity, let us denote 1 p ƒ s p(h h) dp dh h~s, which yields our result. p sp is asymptotically uniform when f~f 0 and p=p 0 . The above result shows that p sp is uniform provided (A1), (A2) and the statistical prior p -which generates the posterior distribution P pop hDx obs ð Þ -is the same as the probabilistic prior p 0 . We can extend this result when both priors differ by showing that that under conditions: -on the likelihood -including the identifiability of the model, and the independence of observations; -on the priors -including that for every h such that p 0 h ð Þw0, we must have p h ð Þw0; -on the discrepancy function -its continuity relative to h; -on the parameter space H -its compactness; then, p sp is asymptotically uniform under (A1) and (A2).
Sketch of proof. If we assume that the parameter space H is compact, that the model is identifiable and that the random variables are independent and identically distributed, i.e.   [4].
Discussion. In these conditions, p sp is asymptotically uniform even when p=p 0 . These results also hold when p 0 is a Dirac distribution d h0 . Under more stringent conditions on the likelihood and the prior, these results can be made sharper -and inform on the speed of convergence -by using the convergence of the posterior distribution to normality [4,[31][32][33].
The sampled posterior p-value: simulation results. Our above results are mathematical and mostly asymptotic. We now study the finite sample behavior of the sampled posterior p-value based on our simulations. Overall, our results for Scenario 1corresponding to a perfect matching of the statistical and probabilistic models -were in accordance with our expectations: p sp and p nsp then had behaviors compatible with uniform p-values (Text S1).
When the statistical prior had the same mode but was sharper than the probabilistic prior in Scenario 2, p sp and p nsp yielded poor results for the studied sample sizes (Table 3 and Text S2), in contrast with their asymptotic good behavior (previous section). Conversely, when the statistical prior was less informative than the probabilistic prior, both p-values were much closer to being uniform (Text S2), in sharp contrast with p prp (Text S6).
Except in one case, p sp and p nsp were also not far from being asymptotically uniform in Scenario 4 when the true parameter value was equal to the mode of the statistical prior (cf. Text S4). An exception was observed for the Bernoulli model with p sp and t~var or d~varc: in this case, p sp did not approach uniformity, even with relatively high sample sizes. On the whole, however, p sp and p nsp were further from being uniform for small sample sizes in Scenario 4 than in Scenario 2 with uninformative statistical priors.
De-centering of the statistical prior (Scenario 3) yielded p sp and p nsp values that were further from the uniform distribution (Table 4 and Text S3). However, p sp and p nsp remained relatively close to being uniform when the statistical prior was less informative than the probabilistic prior and when de-centering was not too strong.
Comparing p sp against p pop and p ML hs for the Poisson model under Scenario 4 with t~max showed thatp pop was conservative, as expected by the mathematical results in [15] while p sp and p ML hs were closer to being uniform for sample sizes 20 and 50 (Text S5). When the true distribution was a Polya distribution instead of a Poisson distribution, p sp and p ML hs were of similar and greater power, except for the highest sample sizes where p pop tended to be slightly more powerful (Figure 1). A difference in power of 10 to 20% in favor of p sp , p nsp or p ML hs was not uncommon and was observed with various discrepancy functions ( Figure 1 and Table 5).

Synthesis of results
In this paper, we first recap on various calibrated Bayesian methods for goodness-of-fit (GOF) p-values and extend the results found in [18] for normalized sampled posterior p-values (p nsp ) in different directions. We show in particular that similar results apply for the more general p sp when the data are not normalized and for discrepancy functions that can be discrete-valued rather than only for continuous-valued test statistic functions. We also show that this p-value is asymptotically uniform when the statistical prior differs from the probabilistic prior (p=p 0 ). Through simulations, we empirically tested this p-value under p=p 0 in a finite sample context. The results show that p sp has a relatively correct behavior provided that the statistical prior is ''not too informative and not too uninformative'', and not too far offcentered, relative to the probabilistic prior. An exception to this statement occurred in Scenario 4 with the Bernoulli model and t~var or d~varc, for which p sp was far from being uniform even for relatively large sample sizes. We think this is because the fixed parameter h~:5 used to sample x obs was precisely the parameter value for which the variance was the largest over the full parameter space. This might correspond to a very slow convergence in this specific case or to a restriction of our asymptotic mathematical results, somewhat similar to the convergence at the edge of parameter space in [4] (Section 4.3). A simulation with h~:7 yielded a p sp that was much closer to being uniform (Text S4). Kolomogorov-Smirnov distance (D) between the simulated p nsp and the uniform distribution and frequency (P 5% and P 1% ) of p nsp found at the 5% and 1% extremities of the unit interval for the Poisson model with t~Z a in Scenario 2 based on 100,000 different datasets. In this Scenario, the statistical prior , has a different sharpness to the probabilistic prior l*Gamma ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi The statistics for the overall sample were D~:019 ÃÃ , P 5%~: 061 ÃÃÃ,z and P 1%~: 016 ÃÃÃ,zz . These results illustrate that for p nsp to be approximately uniform when the statistical prior is not the same as the probabilistic prior, it is preferable for the statistical prior to be less informative rather than more informative compared with the probabilistic prior. Similar results were found for other test statistics and other probability distributions (cf. Text S2). NOTE: The notation for the significance of the tests is as follows: (*) means that the test is significant at a level between .05 and .1; * between .01 and .05; ** between .0001 and .01; *** less than .0001. The notation system for the study of the negligibility of departures from expected values is as follows, for P 5% : 00 (respectively, 0) means 95% of the estimated values of the underlying p-value are in the interval :045; :055 ½ (resp. :04; :06 ½ ); ++ (respectively, +) means 95% of the estimated values are in the interval :06; 1 (resp. :055; 1); -(respectively, -) means 95% of the estimated values are in the interval ½0; :04½ (resp. ½0; :045½). For P 1% , the notations are the same but with cutoff points divided by 5. doi:10.1371/journal.pone.0014770.t003 Based on these new results and on the review of published results (Material and Methods Section), we shortlisted three alternative methods as simple candidates of asymptotically uniform GOF p-values: Method 1: p sp , with a variety of discrepancy functions d and with not too inadequate statistical priors; Method 2: p pop or p ML with a test statistic function t such that asymptotically the mean of t X ð Þ is not dependent on h [15]. Examples of such functions include skewness or kurtosis for the normal distribution, skewness for the t distribution, or the ratio between the mean of the sample and its variance for a Poisson distribution; Method 3: p ML hs , with specific test statistic functions used on uniformized data.
We will also discuss two other, more elaborate sets of methods: Method 4: partial posterior predictive p-values (p ppop ) or conditional predictive p-values (p cp ) used only with test statistic functions, as developed and proposed by [14,15,26]; Method 5: calibrated posterior predictive p-values (p cpp ) as proposed in [17].
One last method could have been to use p pop or p ML with test statistic functions, knowing that they are conservative [15]. However, our results for p pop show that we then lose a significant amount of power compared with p sp and p nsp (Figure 1 and Table 5). This strategy will therefore not be considered further here.

The relative merits of candidate p-values
We hereafter discuss the merits and limits of our preferred method -Method 1 or p sp -in comparison with the other candidate methods. With respect to Method 2, Method 1 has the advantage of allowing the use of various discrepancy functions whereas Method 2 requires very specific test statistic functions; this means that different aspects of the probabilistic model can be studied with Method 1 rather than only the t functions that characterize the hypothesized probabilistic distribution. We agree with [4,20,24] on the necessary adaptation of discrepancy functions to each particular situation where we might want to test departures of data from the model on case-specific features. This makes it possible to include problems involving detection of outliers (t~min or t~max) and dependence between observations [24] in model checking. It also means that p sp appears more flexible and better applicable to very different probability distributions than Method 2: for more complicated hypothesized distributions, it might be difficult to build t functions such that asymptotically the mean of t X ð Þ does not depend on h. On a more theoretical grounding, while p pop and p ML provided default and intuitive responses to question (b) in [34], i.e. ''what replications should we compare the data to?'' -p sp gives a different and less intuitive answer, based on mathematical results: replications should all be sampled from the likelihood based on a unique parameters value, itself sampled from the posterior distribution, and not from multiple parameters values sampled from the same distribution (p pop ) or from the Maximum Likelihood parameters (p ML ).
In comparison with p cpp (Method 5), the main advantage of p sp is its much weaker computational cost inside MCMC computations, including for complicated models. By contrast, p cpp entails multiplying the MCMC computational burden by the number of ''repetitions'' of the model on which post-processing is based. This would take from at least a hundred to a thousand times longer than p sp . From our point of view, this is a major problem, especially in cases such as hierarchical models on large datasets. Therefore, the choice between Methods 1 and 5 may primarily depend on the length of time required to fit the model.
Regarding Method 4, the apparent weakness of p sp compared with the results in [26] for p cp is that we have no information on when the asymptotic behavior is reached -except when the whole Table 4. Behavior of p nsp relative to the uniform distribution under Scenario 3, based on the frequency of p nsp values found at the 5% extremities of the unit interval, depending on the interval of the statistical prior sharpness parameter m s0 (in rows) and offcentering parameter Dlog h 0 ð Þ{1D (in columns).  S3). NOTE: The notation system for the significance of the tests and the negligibility of departures from expected values are as in Table 3. Qualitatively similar results were found for t~mean and t~variance. For t~kurtosis and t~skewness, results were much less strongly and much less frequently significant. doi:10.1371/journal.pone.0014770.t004 Table 5. Difference in power between p sp or p nsp and p pop according to sample size (in columns) and discrepancy function (in rows). statistical model is the same as the probabilistic model used to sample the data. Nevertheless, our simulation results do show that provided the priors are not too informative or too uninformative, and not too far off-centered, p sp is not very far from being uniform. An advantage of p sp over p ppop or p cp is its simplicity: we do not need to calculate the calibrated likelihood of the model with respect to the test statistic, as we do for p ppop or p cp . Moreover, if one wishes to calculate N different p-values based on different test statistics, it can be done inside the same numerical fitting in the case of p sp but must be done N times on N different calibrated likelihoods for p ppop or p cp . A final advantage of p sp over p ppop or p cp is that we have mathematical results for discrepancy functions in general, rather than just for test statistic functions as is the case for p ppop and p cp . Methods 1 and 3 appear very close in terms of applicability and, in the example studied, in terms of power. Their respective powers could be studied in more detail in the future. A common feature of both methods is that they give random results, in the sense that we can randomly reach different p-values for the same observed data x obs . A small advantage in favor of p sp in Method 1 is that it does not require a separate fit on the half-sample, which contrasts with Method 3. A stronger advantage for Method 1 is that its asymptotic validity is proved for general discrepancy functions, whereas the mathematical results we have for p ML hs in Method 3 only apply to specific test statistic functions of uniformized data [28]. This in particular implies that we have no mathematical result on the asymptotic uniformity of p ML hs in Figure 1.
We therefore propose using p sp and p nsp as a good GOF strategy, which is unrestricted with respect to distributions and d functions and which has a reasonable numerical and coding cost. To our knowledge, these are the only p-values that have a known asymptotic probability distribution whatever the discrepancy function.

Notes on how to use the p sp
This section discusses two points related to the strategy of using p sp : the choice of prior distribution, and the choice of the parameter value(s) used to sample ''new data'' and normalize it.
First, our results indicate that we should generally prefer priors that are moderately less informative in data analysis than in data sampling (Table 4 and Appendices 2 and 3). This statement somewhat echoes similar considerations in [17] (Section 9.3). If this result were to be generalizable, it would mean that when p sp indicates a significant departure from the uniform distribution, depending on whether the prior is judged as too informative (or respectively, too uninformative), the same model should be tested with less informative (or respectively more informative) priors. An alternative might be to use p sp in a frequentist setting, provided the asymptotic assumption of normality of the estimators is assumed correct (cf. next section). If significant departures from a uniform distribution are still found, the probability distribution used in the likelihood should be reconsidered in data analysis.
Second, p sp involves a single sampled valueh h value of the model parameter h, which means that the p sp method might give different random results on the same dataset with the same model [18]. An alternative solution would be to use the probabilistic bounds method proposed in [18] (Section 2.3). A further potential alternative we propose, with the formalism of p sp (see Table 1), could be -: associated with the h i ð Þ s sampled from the posterior distribution associated with x obs and e i ð Þ sampled from the uniform distribution; 3-consider the empirical a-quantile of the latter distribution.
Provided analysts use the same value for a drawn at random at the beginning of the first analysis for the same dataset, this would guarantee a better comparability of the analysis of the same dataset by different analysts.

Final global remarks
In contrast to the likelihood principle, calibrated Bayesian techniques involve the use of artificial data -i.e. data that were not observed. This makes pure Bayesians reluctant to use these techniques [35]. Indeed, internal calibrated Bayesian goodness-offit is sometimes considered to be a hopeless cause, where proponents want to have the cake -i.e. estimate model parameters based on all the data available -and eat it too -by confronting the fitted model to the same data that were used to fit it. Calibrated internal goodness-of-fit consequently attracts criticism for using the data twice [8]. Strikingly, p sp seems to provide a nearly uniform pvalue, although it uses the data x obs twice: once to estimate the posterior distribution -from whichh h is sampled -and once again to calculate d x obs ,h h . It therefore appears to warrant the same criticisms as p ML or p pop , which were supposed to justify their lack of asymptotical uniformity. Johnson [16] explains it in these terms, in the context of chi-square statistics: ''Heuristically, the idea [...] is that the degrees of freedom lost by substituting the grouped MLE for h in Pearson's x 2 statistic are exactly recovered by replacing the MLE with a sampled value from the posterior [distribution]''. The proof of Lemma 1 in the Results section reveals another explanation: as we are working on sampled data to fit statistical models, we should also agree to work on sampled parameters to criticize the model. Indeed, this double sampling allowed us to make the roles of data and parameters symmetrical, enabling us to prove our mathematical results. Therefore, the problem lies less in that a GOF p-value uses data twice, but more in how it uses the data twice -see [36] on the need to more precisely define what we mean by ''using the data twice''.
We have applied p sp and p nsp in a Bayesian context. However, as stressed in [16], these p-values might also be used with frequentist methods when the asymptotic assumption of normality of the estimators is correct. Indeed, we applied p nsp on the Poisson case by drawing a value of h at random on the log scale from a normal distribution with the estimated mean as mean and with the estimated standard error as standard error fitted with a Poisson generalized linear model (glm). The results indicate as good a behavior as p nsp used in Bayesian models under Scenarios 1 and 4 (Text S7).
Little [10] once wrote that Bayesian statistics were relatively weak for model assessment compared to frequentist statistics. Although the underused p ML hs might be a good frequentist GOF p-value if its properties are known for more general discrepancy functions, our results highlight an even more attractive solution that mixes frequentist reasoning with a completely Bayesian modeling formulation, by using the sampled posterior p-values (p sp ) in a calibrated Bayesian framework. The transposition of p sp into a frequentist setting has been shown to be correct in the above example, and could therefore represent another potential ''frequentist'' solution. However, we believe that for the not-soinfrequent cases where the normal approximation of the estimate distribution is not accurate -as can be found for binomial or Poisson regression with a high proportion of zero values -a Bayesian framework is more adequate than a frequentist setting for sampling a value of h.

Supporting Information
Text S1 Results of Scenario 1. Text S9 R commands to illustrate the discrepancy between the rough deterministic and the stochastic methods to transform a and b to p-values. Found at: doi:10.1371/journal.pone.0014770.s009 (0.05 MB DOC)