Measurement protocols, random-variable-valued measurements, and response process error: Estimation and inference when sample data are not deterministic

Random-variable-valued measurements (RVVMs) are proposed as a new framework for treating measurement processes that generate non-deterministic sample data. They operate by assigning a probability measure to each observed sample instantiation of a global measurement process for some particular random quantity of interest, thus allowing for the explicit quantification of response process error. Common methodologies to date treat only measurement processes that generate fixed values for each sample unit, thus generating full (though possibly inaccurate) information on the random quantity of interest. However, many applied research situations in the non-experimental sciences naturally contain response process error, e.g. when psychologists assess patient agreement with various diagnostic survey items or when conservation biologists perform formal assessments to classify species-at-risk. Ignoring the sample-unit-level uncertainty of response process error in such measurement processes can greatly compromise the quality of resulting inferences. In this paper, a general theory of RVVMs is proposed to handle response process error, and several applications are considered.


Introduction
The customary way to empirically study some stochastic phenomenon Y, with corresponding probability distribution Pr Y defined over some measurable space, is to observe a set of independent sample realizations of the random variable Y: y 1 , . . ., y N . In the broadest sense, these data are a collection of sample measurements of Y. If these sample observations are subject to measurement error, then one instead observes the sample measurements y � 1 ; . . . ; y � n , where Y � is related to the random variable of interest Y in some meaningful way. A host of measurement error models have been proposed to handle such situations (see e.g. [1][2][3][4][5][6][7]). Regardless of the presence of measurement error, these sample data are then used to derive estimates and make inferences about various properties of Y. In a sense that we will make precise (see Section 2.1), known. At best, sample estimates for these quantities may be available; at worst, nothing at all is known. Even with the benefit of empirical estimates of the pertinent quantities, there is often considerable disagreement among expert assessers as to the quality and relevance of such estimates for the criteria in question. Nevertheless, IUCN guidelines require assessers to arrive at exact classifications for substantive purposes. To address this conflict, "fuzzy numbers" [15] and consequent "fuzzy statistics" [16] have been utilized to specify a best estimate and a range of plausible values (see e.g. [17,18]), combining both empirical estimates from the applied literature for many sub-populations, corresponding sampling and/or measurement error of these estimates, and assessor-specific uncertainties of the quality and applicability of these empirical quantities. In many applications (e.g. those of the previous citations), this framework relies on the use of so-called "triangular numbers." These are particular kinds of fuzzy numbers that naturally arise when one considers an arithmetic of confidence intervals (or, more generally, interval estimates of population parameters), rather than the ordinary arithmetic applicable to point estimates. Consequently, decision-making is a much more delicate process than what arises out of the sometimes overly coarse tradition of the "accept/reject" point-null hypothesis testing paradigm. While a useful approach, these triangular number estimates, similar to the CUB and 3-parameter IRT models of psychometrics, are only equipped to deal with particular types of response process error from a modelling perspective because they still treat the observed sample data (i.e. sample measurements) as fixed, deterministic real numbers.
In brief, this paper proposes a mathematical apparatus for quantifying response process error broadly construed via the notion of random-variable-valued measurements (RVVMs). All sample data in any situation can be viewed as realizations of a measurement protocol, a measure-valued mapping that determines the RVVMs. I will define these ideas formally and show how they generalize and recover the classical case where sample data are totally deterministic (Section 2.1); i.e. sample realizations of a real-valued random variable. I make precise how measurement protocols subject to response process error produce partial information about a random variable of interest (Section 2.1), define a calibration condition that is desirable for measurement protocols to obey, and discuss how this condition can be validated in practice (Section 2.2). Problems of estimation and inference are then considered from the point of view of a classical Bayes' estimator (Section 3.1) and the basic case of Bernoulli-valued measurements is further developed as an application (Section 3.2). The next section contains examples and applications (Section 4) that illustrate the utility, power, and novelty of RVVMs in real world research scenarios relevant to field ecology and clinical psychology. A short section comparing and contrasting RVVMs to a few other related statistical topics (Section 5) concludes the paper. Two appendices containing mathematical proofs of propositions and R code for examples follow.

Measurements as random variables
Let ðO; FÞ be a measurable space. We are interested in the general problem of inferring properties about a random variable Y : O ! R defined on this measurable space. To do so, we draw a random (finite) sample S ≔ fo i : 1 � i � ng � O, and then make a measurement ρ for Y on each sample unit ω i . The following definition characterizes the mathematical structure of this process. Definition 2.1 For a given random variable Y defined on ðO; FÞ, we define a measurement protocol for Y as a measure-valued mapping ρ such that Definition 2.1 requires two things: first, that for any sample point ω 2 O, the corresponding sample measurement ρ(ω) = μ ω is a Borel probability measure on the real line. This allows one to explicitly capture any uncertainty in the measurement response process itself at the individual level of the sample point. As we will see, such a feature is essential for quantifying the notion of response process uncertainty in a mathematically coherent way.
Second, the restriction that the support of the sample measurement lies within the image of Y is needed to ensure that ρ actually produces meaningful measurements for Y, the random variable of inferential interest. A simple example will illustrate this: take Y �Ber(θ) defined on the Borel sets over the reals. Consequently, Y(ω)2{0, 1} for any o 2 R. A measurement ρ for Y must thus assign a probability that the measurement process produces a 0 or a 1 for any sample point o 2 R. If the measurement process produces only fixed data (i.e. no response process uncertainty), then ρ(ω) is a point-mass with support concentrated on the recorded outcome of the observed measurement (i.e. 0 or 1). Moreover, if this measurement process is also free of measurement error, then in fact this point-mass is concentrated on Y(ω)2{0, 1}. Alternatively, if response process uncertainty is present (as in the example of the unsure student answering a test question of the Introduction), then rðoÞ ¼ Berðy o Þ, where the parameter θ ω captures the sample unit (response) uncertainty (i.e. subject-specific confidence) in the correctness of the chosen multiple choice option for the test item in question. If we did not require supp(μ) � Range(Y), then we could allow a measurement process to assign a non-zero probability to an outcome that could not be produced by Y itself. Such a measurement protocol would not then reflect anything inherently meaningful about the random variable of interest.

Generalized likelihoods
Now we need to make sense of how to use the measurements generated by ρ on our sample S to actually study the random variable of interest, Y. In this paper, we will be concerned with making inferences on global properties (parameters) of the random variable Y; e.g. inferring its mean. This is often accomplished probabilistically by creating a likelihood function for the parameter of interest and the observed data. In the classic case, we often write f(y | θ), where y = {Y(ω 1 ), . . ., Y(ω n )}, the sample realizations of the random variable Y.
Our approach will be no different, except that our measurements are not necessarily simply the sample realizations of Y (or of any measurement error prone proxy for Y). Our data, quite literally, consist of the measurements rðS Þ, and for each o 2 S , what we actually observe (via the measurement process) is a measure μ ω . In turn, the measure μ ω gives the response certainty that Y assumes some set of values for a given ω. Define random variables Z ω � μ ω . Thus, we write a generalized likelihood as where Z S denotes the vector of random variables {Z 1 , . . ., Z n }, with Z i ¼ Z o i . This is not recognizable as a true likelihood function since the input Z S is a random variable defined on the product space � n i¼1 ðR; BðRÞÞ. However, for every z in the product space R n , the vector Z S ðzÞ is a fixed sequence of real numbers (in the range of Y); thus, f Y ðZ S ðzÞ j yÞ is an honest likelihood function.
The key observation is that we can now express our generalized likelihood as an average of traditional likelihoods via total probability: where μ S denotes the product measure on � n i¼1 ðR; BðRÞÞ induced by the sequence of measures {μ 1 , . . ., μ n }. In what follows, we will always assume that our sample S is actually a simple random sample, and that our measurement protocol generates mutually independent measurements; i.e. that μ i ? μ j for all i 6 ¼ j. We then have the following canonical expression for the generalized likelihood of a simple random sample generated by independent measurements: where z ≔ (z 1 , . . ., z n ) is a vector in R n . The assumption of an independent measurement protocol allows us to write Z S ðzÞ ¼ ðZ 1 ðz 1 Þ; . . . ; Z n ðz n ÞÞ; and the assumption of simple random sampling allows for the usual decomposition of the joint likelihood into the product of its marginals. As we will see in Section 3, such a generalized likelihood can be used in essentially the same way as any traditional likelihood for purposes of estimation and inference.
It is easy to see that this generalized likelihood collapses down to a traditional likelihood when response process error is absent from the measurement protocol. To recover this classical fixed measurement approach, we simply require our measurement protocol ρ to assign each sample unit ω i 2 O the appropriate observed realization of the random variable Y at ω i . As previously noted, this means that the measurement μ i is a point-mass concentrated at Y (ω i ). In this case, we have μ S ¼ d y 1 � . . . � d y n , where δ x denotes the point-mass at x and we abbreviate with the usual notation y i = Y(ω i ). To see how Eq (2) recovers the classical likelihood associated with these fixed measurements, simply compute Thus, we have that every measurement is the ordinary observed realization of the random variable Y, and the generalized likelihood in Eq (2) recovers the classical likelihood function associated to these fixed data.
Notice that exactly the same argument holds if the measurement protocol ρ produced fixed measurements subject to some kind of measurement error. In this case, one would observe the sample realizations of Y � , some proxy for Y. Crucially, the concept of measurement error assumes that a fixed data point is always observed, Y � (ω), so again the sample measures generated by such a measurement protocol must be simple point-masses. Of course, in this context, the generalized likelihood in (2) In what follows, we will refer to measurement protocols that generate only point-masses (i.e. fixed data) as trivial RVVMs. Such measurements are free of response process error since each sample measurement is deterministic. Absent any measurement error, we should intuitively expect that a set of sample data generated by such a sequence of fixed measurements to contain more information about Y than one generated by a sequence of nontrivial RVVMs (i.e. measurements for Y subject to response process error). Indeed, this intuition can be formalized using the classical notions of Shannon information and relative entropy. It is worthwhile to outline this mathematical interpretation here, as it will allow us to formalize what is meant by the phrase partial information within the context of RVVMs, and to better understand the unique kind of uncertainty that RVVMs can encode.
Suppose ρ 0 is a fixed measurement protocol for Y such that ρ 0 (ω) = δ Y(ω) , and let ρ 1 be any other nontrivial (i.e. not all point-masses) measurement protocol for Y such that μ ω (Y(ω)) 6 ¼ 0. Then, for any ω 2 O, consider the Kullback-Leibler divergence from ρ 1 (ω) to ρ 0 (ω): Since ρ 0 (ω) generates a fixed point-mass measure, the Kullback-Leibler divergence reduces to D KL ðr 0 ðoÞ j r 1 ðoÞÞ ¼ À log ðm o ðYðoÞÞÞ; which is equal to the (always nonnegative) information content of μ ω at Y(ω). This fact is a reflection of the principle of maximum entropy, and precisely quantifies the amount of information lost when the nontrivial RVVM ρ 1 is used as a measurement protocol for the random variable of interest Y at the sample point ω instead of simply "observing" Y at ω itself via ρ 0 . Extending this to the generalized likelihood associated to a set of sample points, we can see that the generalized likelihood associated to the nontrivial RVVMs is naturally more dispersed.
In the above sense, we can say that any nontrivial RVVMs correspond to partial information measurement protocols for Y. Likewise, we can refer to fixed measurements (i.e. trivial RVVMs) as full information measurement protocols for Y. Note that full information does not imply that the measurement protocol for Y is accurate (e.g. it may still be subject to traditional measurement error), since we could perform the same relative entropy calculation as above even if ρ 0 generates point-masses that do not always agree with the sample values of Y. The information content of the measurement protocol is only a meaningful reflection of any response process uncertainty that may be present in the measurement process.
Here, it may be useful to recognize that because response process error dictates that sample measurements themselves are random quantities, one cannot fundamentally separate (and so quantify) response process error from more customary sources of uncertainty: sampling error and measurement error. Indeed, since we treat all sample measurements as potentially random, the sample data themselves are random. Thus, response process error (when it is present) and sampling error are interlocked, since sampling error is quantifiable only with respect to a given dataset. This is not inherently meaningful from a frequentist interpretation of probability, but fits well within the tradition of the Bayesian perspective, where probability is usually construed as a measure of certainty rather than a long-run expected outcome. As we will see in Section 3 when we take up the task of actually estimating population parameters of interest using sample data generated from nontrivial RVVMs, the Bayesian approach will also be far more mathematically natural.

Calibrated RVVMs
Discerning how response process error and traditional notions of measurement error relate is a bit stickier. A proper treatment of this topic requires more pages than this introduction to the mechanics of RVVMs can reasonably hold, but will certainly appear in a forthcoming work. For our present purposes, it is sufficient to distinguish the concepts as we already implictly have: measurement error occurs at a sample unit ω 2 O when a trivial RVVM ρ(ω) is not supported on the true value of Y(ω), whereas response process error is generated by nontrivial RVVMs. Both concepts share the common feature that sample data subject to either kind of error cannot be expected to (fully) agree with the true value of Y for any particular sample measurement. Because of this, our measurements must be calibrated somehow (i.e. vary systematically in some way around the corresponding true values of Y) if we ever hope to quantify the accuracy of any resulting sample estimators for population features of Y. Classically, this necessity gives rise to a host of different calibration conditions, usually phrased in the context of one of many different measurement error models (see Kroc and Zumbo [19] for a detailed summary of additive measurement error models, and see Gustafson [6] for a treatment of multiplicative measurement error). Within the context of response process error, we will be concerned with a similar type of calibration condition. As we will see, this condition will ensure that many sample estimators of interest are well behaved.
Fix a measurement protocol ρ. For any given ω 2 O, define the set That is, O o contains all the sample points ω 0 2 O that map to the same measure as does ω under the the measurement protocol ρ. We will assume that O o is ðO; FÞ-measureable, although this is not necessarily so apriori. We then have the following important definition. Mathematically, the novelty of this definition lies in the fact that it equates expectations on two different probability spaces. Notice that both quantities in play are functions of ω 2 O. The lefthand quantity is simply the expectation of the measure assigned to the sample unit ω via the measurement protocol ρ, whereas the righthand quantity calculates the conditional Definition 2.2 captures what we would expect to hold if the measurement protocol ρ generates response process error that is still accurate on average; i.e. if it is given by an expert observer (see below). An important consequence of this kind of calibration is that it allows one to easily construct accurate estimators of certain population parameters of Y, the actual random variable of inferential interest. This notion of calibration will be exploited in Section 3 to prove asymptotic unbiasedness and consistency of certain Bayes' estimators (see Propositions 3.1 and 3.2), and we will use it again in Section 4 when we consider certain real world instances of RVVMs in greater detail. To establish the former, we will require the following simple result that a generalized sample mean is an unbiased and consistent estimator of the population mean.

be a simple random sample drawn from O, and let ρ be an independent measurement protocol on O. Define the sample estimator
Then if ρ is calibrated to Y, some random variable with finite mean and variance, � rðS Þ is an unbiased and consistent estimator of EðYÞ.
With Proposition 2.3 in mind (see the proof in S1 Appendix), it is worthwhile to spend some time unpacking the practical meaning of the calibration condition of Definition 2.2. In order to derive accurate inferences, it is crucial that the sample measurements ρ(ω) be reliable in some sense. The calibration condition above says that the expected value of the sample measurement coincides with the expected value of Y over the set of sample points that generate the same sample measurement.
We can quickly see that traditional fixed measurements that are free of measurement error must always be calibrated. Using the logic of the previous subsection, we know that for such a measurement protocol, we can write rðS Þ ¼ fd y 1 ; . . . ; d y n g. Of course, the expectation of any random variable, X, distributed according to one of these measures is the appropriate y i . Now, notice that for any i, This is simply because we have required our measurement protocol ρ to always return the appropriate fixed value of Y upon measurement. But now for any i, since the random variable Y is always constant on the set Y −1 (y i ). Moreover, using the standard formula for total variance (see the proof of Proposition 2.3 in S1 Appendix), Var (Y | Y −1 (y i )) = 0 for any i, so Varð� rðS ÞÞ ¼ VarðYÞ=n. Proposition 2.3 thus becomes a simple generalization of the Weak Law of Large Numbers for the traditional sample mean.
To understand the calibration property in context, let us consider an example from field ornithology (we will further develop this example in Section 4): assigning the proper sex to a bird captured in the field. An expert researcher may inspect a single individual, denoted by ω, and assess (i.e. measure) the sex as female with 90% confidence (i.e. response certainty). This assessment is the product of a combination of expert diagnostics, including plumage characteristics, body shape, bill shape and size, etc. Importantly, some of these diagnostics may be subjective. We can imagine that this same researcher might capture another bird with the same or different morphological characteristics, and subsequently assess (i.e. measure) the sex of this new bird as female with 90% confidence (i.e. response certainty) again. Since the RVVMs generated by this measurement protocol must be Bernoulli according to Definition 2.1, all birds that generate this same measure of confidence in female sex assessment form the set O o . This researcher's expert assessment is considered calibrated if, among the individuals in this set O o , 90% of them are actually female. This is precisely what we would expect to happen if the researcher performing the measurements is well-trained and knowledgeable; i.e. expert.
This interpretation suggests several ways that the calibration condition in Definition 2.2 can be validated in practice. A researcher could assess a small set of sample elements that are then independently measured exactly (i.e. without response uncertainty), and have their RVVMs checked against these gold standard measurements, with agreement on average yielding calibration. Or, as is common practice in many small bird-banding operations, researchers can keep track of their measurements and cross-check them for accuracy if and when the sample unit is resampled in subsequent banding seasons. In some cases, the obstacles to assigning a sure classification can disappear upon resampling, as when young banded birds are recaptured at a later stage in life when sexing characteristics are definitive. In still other cases, a sample RVVM can be recorded and further measurement protocols (partial or full information) may be implemented that eventually yield a definitive classification. Such situations can commonly arise in applied medicine when patients are preliminarily diagnosed with a particular affliction, and are then subjected to follow-up examinations and tests to confirm or deny an initial diagnosis.
Even in the absence of rigorous validation, we may reasonably expect the calibration condition of Definition 2.2 to hold as long as the observer assigning the measurement is sufficiently trained and knowledgeable in the aspects of diagnosis and discrimination particular to the research and measurement setting. And inversely, we should not expect such a condition to hold for less experienced assessers. In fact, we may expect such measurements to exhibit important, structural miscalibration, resulting in systemic biases of inferences. This will often be the case in studies of social and psychological phenomena where sample respondents are untrained in uncertainty assessment (e.g. nontechnical survey respondents), or random variables of interest are loosely defined latent constructs.

Estimation and modelling with RVVMs
In this section, we will investigate the practical use of RVVMs for estimation and inference. Since data generated via a nontrivial measurement protocol generate a generalized likelihood as in Eq (2), it is not immediately apparent how estimation might proceed in a modelling context. The most straightforward approach to estimation and modelling with RVVMs is the Bayesian one, both in terms of direct interpretability of probability as uncertainty and with regards to analytical tractability.

Classical Bayes' estimator with RVVMs
Consider the problem of estimating some model parameter θ for Y with sample data generated by an independent measurement protocol ρ. The classical Bayes' estimator of y j rðS Þ, i.e. the one minimizing the posterior expected value of the mean squared error, is: The last line takes the familiar form of the classical Bayes' estimator with a generalized posterior assuming the role of the classical, fixed data (i.e. full information) posterior. Naturally, one could rewrite this posterior in terms of the corresponding likelihood(s), prior, and normalizing factor(s): Note that the normalizing factor is now a function of the RVVMs (since it is a function of the sample data).
Eqs (4) and (5) suggest just why the Bayesian approach to estimation with RVVMs is fundamentally more mathematically tractable than an optimization based one, such as maximum likelihood. The Fubini-Tonelli Theorem allows for unfettered exchangeability of the integrals over θ and the joint (probability) measure m S in the definition of the Bayes' estimator, which greatly simplifies the computational problem. On the other hand, a maximum likelihood approach would require optimization of a function of the generalized likelihood (2), a potentially substantial task.
Notice that Eqs (4) and (5) also imply that if π 0 is a conjugate prior for the traditional likelihood f Y (Z(z) | θ), then π 0 is a conjugate prior for the generalized likelihood f Y ðZ S j yÞ, in the sense that the generalized posterior is simply a μ S -weighted average of traditional posteriors that belong to the same parametric family as π 0 . This property can also greatly aid in computation (e.g. see S2 Appendix), and even allows for analytical expressions of the Bayes' estimator in a variety of classic scenarios.

Bernoulli-valued measurements
The most basic such scenario is the study of a Bernoulli phenomenon Y � Ber(θ 0 ). In this case, any measurement protocol ρ for Y can only yield Bernoulli-valued measurements ρ(ω) for any in general, and θ ω 2 {0, 1} corresponds to a full information RVVM (i.e. no response process error). Given a random sample S � O, the generalized likelihood (2) becomes Since the value of this generalized likelihood is driven only by how many Bernoulli successes occur among the RVVMs, we can simplify this expression by defining the measure ν = μ 1 � . . . � μ n and the random variable W � ν, yielding Note that ν is not in general a Binomial measure, since the μ i measures are not necessarily identical; ν is a categorical measure (multinomial on one trial) in general. However, ν is discrete on R, so we can simplify things further and write This simplified version of the generalized likelihood can greatly ease the analytical and computational burden of working with Bernoulli-valued measurements, a fact we exploit in the computations of Section 4 (see S2 Appendix for computational details).
Given a prior distribution π 0 on θ, and using (6), we now have a tractable form for the generalized posterior Consonant with above, if π 0 is a generic Beta(α, β) prior, then we see that this generalized posterior is simply a ν-weighted sum of Beta densities. Put another way, conditional on W = k, the density is Beta(α + k, β + n−k).
Using (7) and (4), the Bayes' estimator of θ becomes Eðy j rðS ÞÞ ¼ Pr n ðW ¼ kÞ Using a similar argument, we can also derive an analytical expression for the posterior variance: In general, the behaviour of this posterior can be quite variable depending on the exact structure of the Bernoulli-valued measurements, even as sample size is increased. In particular, it need not be unbiased or even asymptotically unbiased for θ, and since Var ν (W) can be on the order of n 2 by the sharpness of Popoviciu's Inequality [20], the classical Bayes' estimator need not be consistent for a general independent measurement protocol. However, when we assume that our measurements are calibrated in the sense of Definition 2.2, we do have the following interesting result (see the proof in S1 Appendix). Proposition 3.1 Letŷ ¼ Eðy j rðS ÞÞ be the Bayes' estimator in (8) under an arbitrary Beta (α, β) prior on θ, where Y � Ber(θ 0 ). Suppose that an independent measurement protocol ρ is calibrated to Y according to Definition 2.2. Thenŷ is an asymptotically unbiased estimator of θ 0 .
Suppose now that only some of our sample measurements (not necessarily calibrated) generate response process error, while the others are ordinary sample realizations of Y free of measurement error. Then the following proposition holds. There are two implicit though important implications of Proposition 3.2 (see the proof in S1 Appendix). The first is that an RVVM mapping need not be calibrated to yield asymptotically unbiased and consistent estimates. Indeed, condition (ii) is sufficient to ensure such behaviour. Intuitively, this is not surprising given that condition (ii) guarantees that the full and accurate information about Y contained in the subsample of (accurate) fixed measurements will always overpower the partial information contained in the nontrivial RVVMs.
The second important implication is that calibration is not sufficient for consistency of the Bayes' estimator. This too is not surprising when we realize that we have not placed any stabilizing condition on the RVVMs as the sample size grows. Indeed, notice that the first term in the variance expression (9) always approaches zero asymptotically, but that the second term need not. This is a reflection of the fact that the nontrivial RVVMs are theoretically allowed to have as much entropy as we like; thus, we should not expect that their expected value will stabilize as the subsample of nontrivial RVVMs grows. Unless this expectation is stabilized by a relatively greater subsample of fixed measurements, the (generalized) posterior need not converge to a point. Informally, we cannot hope to "sample away" the partial information in our nontrivial RVVMs simply by collecting more of them. In the context of the applied examples considered in the Introduction and in Section 4 below, this phenomenon is totally expected.

Applied examples and comparisons with methods that ignore response process error
In this section, we illustrate the power and the mechanics of RVVMs to explicitly quantify response process error as part of an integrated data analysis. Results are contrasted with those that would be generated via a typical analysis (i.e. one that ignores response process error) to further emphasize the utility of RVVMs.

Research scenario 1: Sexing birds in the field
Wildlife researchers are often interested in recording the nesting locations of an avian species; however, direct visual confirmation of a nesting site is not always possible [21]. Instead, various diagnostics can be used to assess the likelihood that a nest is present: for example, territorial and mating displays that are characteristic of nesting pairs. Such tentative determinations are an instantiation of the response process error inherent in the measurement process.
In many bird-banding operations individual birds are captured, tagged, and a variety of morphological characteristics are recorded before the birds are released back into the wild. Some measurements are exact, like wing chord and tail length, while others are more diagnostic, like sex and age. In many bird species, sex can be difficult to determine in the field with complete confidence even for highly trained professionals [22]. This is especially true for young birds that have yet to develop their adult plumage [23]. The North American Bird Banding Program requires at least 95% confidence in sex determination before considering a sexing observation valid [24]. Other programs, such as the Vancouver Avian Research Centre, consider at least 90% confidence in sex determination sufficient to generate useable information [25]. Current applied practice dictates that any response process uncertainty is ignored: either when a sample sex identification with 95% certainty is treated as equivalent to a definitive sample sex identification, or when identification uncertainty dips below the acceptable threshold and so the sample datum is discarded. Indeed, this has long been considered good practice in field ornithology for sex and age determination, where Ralph et al. [23] have advised us that "it is better to be cautious than inaccurate." Utilizing the framework for handling response process error proposed in this paper, however, we will see that these two options do not have to be exclusive.
Consider the problem of estimating the sex distribution of a population of small songbird that nests in a particular valley in the summer months, a Bernoulli phenomenon denoted by Y.
Suppose this songbird requires two years to obtain its full adult plumage, after which identification of sex is exact due to pronounced sexual dimorphism. Both adult and juvenile birds return to the valley of interest each summer, and we aim to quantify the distribution of sexes at the beginning of the breeding season. Juveniles can only occasionally be exactly sexed by plumage or other morphological characteristics; usually, only educated guesses can be made instead. Consequently, sex measurements may be subject to response process error.
Suppose that a team of researchers has been banding and collecting data on these birds for many years. Consequently, they are experienced and well-versed in distinguishing sexes accurately at all ages; i.e. suppose that the measurement protocol that quantifies their response process error is calibrated to Y according to Definition 2.2. Note that the assumed accuracy of the researchers in question could have been explictly justified already, perhaps by cross-checking previous tentative sexing diagnoses of banded juveniles against definitive sexing data on recaptured individuals in subsequent breeding seasons.
For our illustration, suppose 50 birds have been sampled at our test location, and data on individual age (0 = juvenile, 1 = adult), weight (in grams), wing chord length (in cm), and sex (0 = male, 1 = female) have been recorded. Measurement protocols on age, weight, and wing chord length generate full and accurate sample information (i.e. generate trivial RVVMs free of measurement error). We will consider three different measurement protocols for sex in this exploration: ρ 1 , ρ 2 , and ρ 3 . The first, ρ 1 , generates full and accurate sample information (trivial RVVMs free of measurement error) for all 50 individuals. This is an idealized measurement protocol that is not actually realized in the field; i.e. the measurement protocol that always returns the true sex of the sampled individual with total certainty. The second measurement protocol, ρ 2 , acts the same as ρ 1 , but is only applied to the subsample of individuals that can be definitively sexed in the field. Any individuals that cannot be definitively sexed in the field are discarded entirely from the sample, which reflects current best practice in field ornithology [24]. The final measurement protocol, ρ 3 , will consist of a mixture of trivial and nontrivial, but still calibrated, RVVMs. For those individuals that can be definitively sexed in the field, ρ 3 (ω) = ρ 1 (ω) = ρ 2 (ω). However, for individuals that cannot be definitively sexed in the field, ρ 3 encodes the certainty that the individual is female, as generated by the domain-expert field technician. Table 1 provides a snapshot of the sample data for each of the three measurement protocols (note that the full dataset(s) can be generated using the R script of S2 Appendix). Note that only 8 birds (all juvenile) are not definitively sexed by the bird-banders.
In the absence of any prior information on sex distribution, it may be natural to expect a uniform split between male and female birds at any age. However, suppose in reality there is more of a tendency for juvenile females to return to their birthplace than for juvenile males, a tendency that disappears once the birds reach adulthood due to differential behavioural changes (e.g. greater pressure on males to find and establish new breeding territory, forcing them to disperse earlier from their natal sites). Unknown to the researchers, suppose the true percentage of adult females at the site of interest is 50%, while the true percentage of juvenile females is 75%. Thus, for our particular data, the dearth of definitively sexed juveniles would inevitably confound any derivative inferences on the sex distribution over the entire population, as well as within the juvenile subpopulation only, if we chose to ignore the nontrivial RVVMs by using measurement protocol ρ 2 .
For this particular dataset, 19 of the 38 sexed adult birds (all definitively sexed) were female, while only 4 out of the 12 juvenile birds were definitively sexed: all these birds were female. Of the remaining 8 partially sexed juveniles, 5 were actually female (unobserved under the realistic measurement protocols of ρ 2 and ρ 3 ).
For the three measurement protocols, we will compare how good of a job the resulting posteriors, and their corresponding Bayes' estimators, do at capturing the true sex distribution in the total population, as well as within the subpopulation of juveniles. We will then complicate the problem by incorporating weight and wing chord data, which will be seen to have differential effects on age and sex. Moreoever, we will compare how the RVVM approach, the only one that explicitly accounts for response process error, compares to a missing data approach where the non-definitively sexed individuals from measurement protocol ρ 2 have their sexes imputed.
All numerical calculations were performed in R [26]. Multiple imputations were performed using the 'mice' package [27]. Regression models were fit using the 'RStanArm' package to approximate the appropriate posteriors [28]. All R code is available in S2 Appendix of this article.

Subgroup analysis.
We will start by examining our estimates of the overall proportion of females. Table 2 contains the values of the Bayes' estimators and the standard deviations of the corresponding posterior distributions when we use the data generated from each of the three measurement protocols. All three estimators assume a naive prior of Beta (15,15) for the overall proportion female.
The estimated proportion of female birds is the same between the full fixed dataset (unobserved), ρ 1 , and the RVVM-generated dataset, ρ 3 . This is expected since, as we note in the proof of Proposition 3.2, using calibrated RVVMs does not inject any additional bias into the Bayes' estimator than what would already be present under the complete and accurate information measurements. However, the corresponding posterior distribution under the RVVMs is slightly Table 1. Example data layout and sample data for the example bird-banding measurement protocols. Note that the Bernoulli-valued measurements for sex give the observed response process certainty that the sample unit is female.

PLOS ONE
more dispersed, reflecting the inherent uncertainty in the nontrivial RVVMs and their use of partial, rather than complete, sample information. Note that the estimated proportion derived from the observed fixed measurements, ρ 2 , is not as accurate as the RVVM-derived estimate due to the decreased sample size (42 vs. 50) and the lack of partial information use. Note also that missing data techniques cannot be applied to the data generated by ρ 2 , simply because we are not utilizing information on any covariates. The RVVM framework of course makes no such requirement. Now consider what happens if we estimate the proportion of female birds according to age categorization. Here, we will be in a situation amenable to imputation of missing values under measurement protocol ρ 2 . Table 3 contains the output of logistic regressions for each of our four estimates of interest. Each of the four estimates are derived by assuming a default N(0, 2.5) prior on the 'age' effect and a default N(0, 10) prior on the model intercept.
The RVVM-based model does a much better job than either of the ρ 2 -generated model fits of reflecting the ideal model fit under the full fixed (unobserved) dataset generated by ρ 1 . Estimated model coefficients are far more accurate in the RVVM-based fit, and the corresponding posterior standard deviations are naturally wider than those from the idealized ρ 1 -generated dataset. Again, this reflects the inherent response process error inherent in the measurement protocol ρ 3 , captured by the nontrivial RVVMs.
Interestingly, the posterior uncertainties under the imputation-based approach are larger than the RVVM-based uncertainties. The reason for this becomes plain when we consider the estimated odds ratios within each age group (bottom two rows of Table 3). Recall that all the definitively sexed juveniles were female; thus, there is no way for an imputation procedure to assign a reasonable chance of observing a juve-nile male, as there are no complete observations on this subpopulation. Such structural confounding yields a severly inflated odds of female sex among juveniles and also inflates the variance in the posterior distributions of the model coefficients.

Multiple regression with RVVMs.
We now consider what happens when we fit slightly more complicated regression models in an attempt to uncover finer relationships between the four observed variables: sex, age, weight, and wing chord length. First, we aim to model sex as a function of age and weight. Consider the boxplots in Fig 1. The weight data have been generated so that adult female weights are distributed as N(50, 5) and adult male weights are distributed as N(60, 5) (see the S2 Appendix for reproducible code). For juvenile weights however, the distributions are normal mixtures; this introduces confounding via the measurement process. The idea is that underweight juveniles may be more difficult to definitively sex; thus, definitively sexed juvenile females have weights distributed as N (30,5), while partially sexed juvenile females have weights distributed as N (20,5). Similarly, definitively sexed juvenile males have weights distributed as N (40,5), while partially sexed juvenile males have weights distributed as N (30,5). No weight data are assumed missing. This type of measurement process confounding has two main effects: (1) all underweight birds will tend to be categorized as female by a missing data approach, and (2) since our data contain no definitively sexed juvenile males, the extra weight covariate will give us no informational leverage with which to model this subpopulation using a missing data approach. In contrast, the RVVM framework will allow us to fix both issues, since calibrated measurements will negate any confounding introduced by the incompleteness mechanism (on average), at the cost of the additional uncertainty generated by the nontrivial RVVMs. Table 4 contains the output of logistic regressions under each of our four comparison scenarios. Again, each of the four sets of estimates are derived by assuming default N(0, 2.5) priors on all covariates, and a N(0, 10) prior on the model intercept.
Examining the estimated coefficients only, it is quite clear that the RVVM approach closely aligns with the model estimates we would expect if the full true (unobserved) data were  available. Posterior uncertainty in the RVVM framework is comparable to that produced by the fit with the full true (unobserved) data, reflective of the fact that for these sample data, there is very little response process error present. In contrast, the missing data approach performs very poorly. Examining the estimated raw odds, the full information data generated under ρ 1 produce an odds of female among juveniles with weight = 30 g of 2.96, and an odds of female among juveniles with weight = 25 g of 9.42. These estimates are expected when one considers the distribution of the full (unobservd) data over subgroups, as in Fig 1. The estimated odds from the RVVM dataset are similar to the estimated odds from the full ρ 1 dataset. In contrast the estimated odds from the ρ 2 -generated dataset, with or without imputation, are horrendous over the juvenile subgroups. This is not surprising given how the sex data are not missing at random and that there are no definitively sexed juvenile males in our sample. It is important to recognize that the RVVM framework is not susceptible to this same source of confounding (on average) under calibration of the RVVMs. Put another way, there is no need for the partial data to be "incomplete at random," so long as the RVVMs are calibrated.
Next, we aim to model sex as a function of age and wing chord length. Fig 2 displays boxplots for the full information (unobserved) sample data generated under ρ 1 . Wing chord length was generated as random draws from a N(11, 1) distribution for juvenile females, and adults of both sexes. Thus, missing data in the 'sex' variable are now "missing at random" (MAR).
Wing chord length for juvenile males was generated from a N(8, 1) distribution. Here, we model sex as a function of age and wing chord length, using the same default priors that were used in the previous example. Table 5 displays the results.
Even though the 'sex' data are MAR, the results are similar to the previous example. The model estimates and the resulting estimated odds are still quite bad for the juvenile subgroups with the imputed dataset precisely because even though the missing 'sex' values are MAR, we do not actually observe any definitively sexed juvenile males. The juvenile male group is the only one that generates differential wing chord lengths on average; thus, an imputation procedure will tend to assign these different (smaller) wing chord lengths to the juvenile male group. However, this is a passive prediction born of a lack of alternatives rather than an informed categorization.
In contrast, the RVVM-derived estimates recover the true relationships between the variables, and the estimated odds align well with the estimates on the full and accurate information (unobserved) dataset. Also, note the comparable posterior uncertainties to those generated by the full information (unobserved) dataset; again, a reflection of the relatively small amount of response process error captured by ρ 3 vs. ρ 1 .
Finally, we consider an ordinary normal model for wing chord length as a function of sex and age. Table 6 summarizes the output from these model fits using the same default priors as the previous example.
The RVVM fits are once again closer to the fits given the full and accurate information (unobserved) data, but interestingly, the model fit on the ρ 2 -generated data with imputation does not actually appear to be too much worse if one only considers the fitted values in each subgroup. However, when attention turns to the model coefficients, it becomes clear why the model fit under measurement protocol ρ 3 is preferable to the ones fit under measurement protocol ρ 2 . Once again, the RVVM-based data produces more accurate estimates since it actually contains partial information on juvenile males, whereas the ρ 2 dataset contains no observed complete information on juvenile males. Moreover, there is clear evidence of an average difference in wing chord length between the juvenile male subgroup and any other subgroup when considering either the full information (unobserved) dataset under ρ 1 or the RVVM-generated

Research scenario 2: Diagnostic rating scales in clinical practice
For our second detailed application, consider a typical scenario in applied psychology where a trained psychologist must diagnose a patient for depression. While many diagnostic techniques and paradigms exist (e.g. see [29,30]), the Hamilton Depression Rating Scale (HAM-D), or one of its many variants, is a highly classical tool that is still widely used today (e.g. see [31][32][33] Clearly, there is no objective or consistent distinction between, say, the categories of "occasional weeping" and "frequent weeping." A large amount of subjective interpretation of those terms, as well as how well they apply to the particular patient in question, will inevitably factor into the health care professional's score assignment; i.e. into the sample measurement process. Notice too that this subjectivity can be unique to both the patient being scored and the assessor conducting the scoring. Various construals of semantical uncertainty, blurry categorization, partial ordering, and contextual applicability are all potential instances (see e.g. [17,[35][36][37]) of response process error.
The traditional approach requires the assessor to simply assign the best-fitting or most appropriate category to the patient for the item: an integer between 0 and 4. An RVVM approach however could require the assessor to indicate their confidence in the applicability of each of the 5 categories to the particular patient in question. Table 7 summarizes the type of sample data that would be generated by these two different measurement protocols (denoted by ρ 1 and ρ 2 respectively) for 8 hypothetical patients, assumed to all be assessed by the same health care professional. We have abused notation slightly here and recorded simply the vector parameter that characterizes each (sample) measurement protocol. So for the trivial RVVMs generated by ρ 1 corresponding to current practice, we record the support of the point-mass rather than the measure itself. The nontrivial RVVMs generated by ρ 2 are always (discrete) categorical measures on 5 categories (the integers 0 to 4); thus, we have simply encoded the measure of these atoms.
Notice that there are considerable differences among, say, all patients who received a score of "1" under measurement protocol ρ 1 that are completely hidden by the classical measurement apparatus. Classically, these patients are simply assigned their (subjective) "best" score as decided by their common assessor; thus, they are indistinguishable on this item; i.e. they all receive the same measurement under ρ 1 . But the use of (nontrivial) RVVMs, ρ 2 , reveals potentially important clinical differences: e.g. while the assessor is very confident in their score for patient 2, they instead retain considerable uncertainty in their score assignment for patient 1, leaning towards a more severe score. Put another way, a fixed score of "1" seems to mean something quite different for these two patients.
The same two measurement protocols could be applied to every item of the HAM-D, and so opportunities for further clinical distinctions to simultaneously manifest in the sample values of ρ 2 , and be hidden in the sample values of ρ 1 , will only accumulate. This means that two different diagnoses for the same patient can be reached depending on which measurement protocol is used. Table 8 summarizes what could happen for our 8 hypothetical patients. Notice that the HAM-D scores agree over the two measurement protocols quite well for some patients, e.g. patient 8, but differ in clinically significant ways for other patients. In particular, patient 6 and patient 7 would both be classified with "Moderate Depression" according to the classical, point-mass measurement protocol, ρ 1 . However, when accounting for various response process uncertainty in assigning item scores under ρ 2 , there is a noticeable separation of scores between the two patients. Moreover, under ρ 2 , patient 6 would be classified with "Severe Depression", while patient 7 would retain the "Moderate Depression" diagnosis.
The sum scores for patients assessed under measurement protocol ρ 2 were constructed by using the implied estimator from Proposition 2.3. Recall that this proposition asserted a Weak Law of Large Numbers for calibrated RVVMs, implying that the natural analogue of the traditional sample mean for nontrivial and calibrated RVVMs takes the form x dm o i ðxÞ: From this, we see that the natural analogue of the sample sum is n � � rðS Þ; this is the statistic used to compute the HAM-D scores under measurement protocol ρ 2 in Table 8. This estimator would also be a natural choice in this context because presumably, if we were following recommended clinical practice, each patient's item scores would be generated by an expert assessor, the health care professional. Thus, we would expect that the nontrivial RVVMs generated by ρ 2 will be calibrated to the phenomenon of interest according to Definition 2.2.
Notice that this assumption is not usually unique to the measurement protocol ρ 2 in applied practice; indeed, if the classical ρ 1 -generated HAM-D scores were used for clinical decisionmaking purposes, it would be implicitly assumed that they too were calibrated according to Definition 2.2; i.e. that they were accurately measuring the target phenomenon of interest. Psychometricians will often speak of the validity of a rating scale, and while that term has many different and often imprecise meanings (see Zumbo & Hubley [8] for thorough discussion), one key facet that the term usually encapsulates is exactly the idea that the measurement in use is fidelitous to the phenomenon. The notion of calibration introduced in this paper is certainly, at least, a part of that idea.

Discussion
The specific theory for Bernoulli-valued measurements developed and applied in the previous sections can be generalized in a straightfoward manner to categorical-valued measurements. The general theory of RVVMs of course applies equally well to non-discrete-valued measurements, although the analytical niceties of Section 3.2 become far less obvious. Nevertheless, the RVVM framework provides a coherent means of incorporating the quantification of response process error into any applied data analysis, albeit with the caveat that considerable computational power may be required to obtain useable estimates and make valid inferences.
The general idea of response process error and the specific mathematical machinery to quantify it proposed here share many conceptual features with more traditional ideas in the statistics literature, notably: measurement error, fuzzy statistics, elicitation, and missing data. I have already discussed the relationship between RVVMs and measurement error in the preceding sections. Now, consider the other three domains.
As previously indicated in the Introduction, notions from fuzzy numbers/statistics usually arise in practice via the application of "triangular numbers," e.g. [17,18]. In their more general formulations however (see e.g. [38,39]), fuzzy numbers are used to extend a real number to a certain kind of real-valued function, or a random variable to a certain kind of set-valued function. Fuzzy statistics tend to operate then as a means to construct new sample estimators from old ones using the arithmetic of fuzzy number systems, but still assuming that the sample data used to construct constituent estimators are deterministic (see e.g. [16]). The RVVM framework proposed in this paper operates instead by assigning a probability measure directly to each sample instantiation of a measurement process, which allows for the possibility that our sample observations are not simply (fixed) numbers or functions. This idea is similar in spirit to the "fuzzy information" approach developed by Okuda et al. [40] and Tanaka et al. [41], among others, where one assumes that observed sample data can be fuzzy numbers themselves. It would be interesting to investigate what results from this fuzzy information framework can be translated over to our measure-valued one; future work should focus on this. The idea of elicitation (see e.g. [42][43][44]) aims to use expert information that does not take the form of a fixed measurement of a sample process to improve inferences about the population process. This use of subjective and imprecise expert information makes elicitation conceptually similar to the RVVM framework, specifically to the case of calibrated measurement protocols. However, the two ideas differ substantially in the type of expert information gathered and in how it is eventually used to inform inference. The elicitation method aims to formally build expert information into an informative prior to improve inference; crucially, elicitation does not use expert information to adjust the actual sample measurements that are used to create a likelihood; i.e. it does not attempt to quantify response error in a sample measurement. Put another way, elicitation uses expert information to better calibrate the assumptions behind an inferential model (via a prior), while calibrated RVVMs use expert information to alter the sample data, and so the inferential model, directly (via the likelihood).
Missing data problems have a long history (see Rubin [45]), and techniques for handling them have enjoyed considerable success in a variety of fields (see e.g. [46][47][48]). Traditionally, the presence of response process error has sometimes been assumed to generate missing data, as in Ralph et al.'s [23] recommendations for indefinite age and sexing determinations in field ornithology (see Section 4.1). However, the phenomenon of response process error is not simply a type of missing data problem.
The structural distinctions are easy to make since the missing data framework assumes that all data are fixed (i.e. deterministic), even those data that are missing. One either observes a fixed measurement of a random variable Y, or one does not. Typically, when some fixed measurements are missing, one then proceeds to leverage information from complete observations on related random variables (covariates) to predict (i.e. impute) the unobserved values of Y. Critically, this process requires fixed measurements on auxiliary random variables to get started. Equally important, this process has nothing to say about response process uncertainty inherent in the (sample) measurement process itself.
Moreover, missing data techniques are model-dependent, whereas response process error is an essential feature of the sample data themselves. Partial information due to response process error is not the same thing as a total lack of information due to missingness. Indeed, there is a fundamental difference between a measurement process that, say, generates a partial species identification for an individual (say, 50% certainty between two possible species), and one that generates no information by simply not sampling or measuring the sample individual.
It should be clear now that a variety of distinctions exist between the idea of response process error (quantified via RVVMs) and related concepts of imprecise measurement, like traditional measurement error. It is important to note, however, that these different ideas need not occupy distinct domains in applied practice. In fact, it is entirely plausible that an applied researcher may find herself in a situation where the measurement process generates response process error in addition to actual missing data and traditional measurement error. If previous expert information relative to the study phenomena is also available, elicitation could of course be used to inform the priors. Triangular numbers too could be applied to credibility intervals resulting from any analysis to further inform the decision-making process. RVVMs provide a structured and mathematically coherent way of incorporating partial information due to response process error into an ordinary statistical analysis.
RVVMs arise naturally in a variety of applied research settings. For the most part though, the partial information that they generate has traditionally either had to be simplified (to the detriment of both accurate estimation and reliable inference), or discarded altogether. The theory developed in this paper is only a first step towards a robust and comprehensive theory of this type of sample data, but I contend that it is time to make explicit use of all the information contained in measurement processes subject to response process error.