Embedding Quantum into Classical: Contextualization vs Conditionalization

We compare two approaches to embedding joint distributions of random variables recorded under different conditions (such as spins of entangled particles for different settings) into the framework of classical, Kolmogorovian probability theory. In the contextualization approach each random variable is “automatically” labeled by all conditions under which it is recorded, and the random variables across a set of mutually exclusive conditions are probabilistically coupled (imposed a joint distribution upon). Analysis of all possible probabilistic couplings for a given set of random variables allows one to characterize various relations between their separate distributions (such as Bell-type inequalities or quantum-mechanical constraints). In the conditionalization approach one considers the conditions under which the random variables are recorded as if they were values of another random variable, so that the observed distributions are interpreted as conditional ones. This approach is uninformative with respect to relations between the distributions observed under different conditions because any set of such distributions is compatible with any distribution assigned to the conditions.


Joint Distributions and Stochastic Unrelatedness
Many scientific problems, from psychology to quantum mechanics, can be presented in terms of random outputs of some system recorded under various conditions. According to the principle of Contextuality-by-Default [1][2][3][4], when applying Kolmogorov's probability theory (KPT) to such a problem, random variables recorded under different, mutually incompatible conditions should be viewed as stochastically unrelated to each other, i.e., possessing no joint distribution. They can always be ''sewn together'' as part of their theoretical analysis, but joint distributions are then imposed on them rather than derived from their identities. In this paper we discuss two possible approaches to the foundational issue of ''sewing together'' stochastically unrelated random variables. We call these approaches contextualization and conditionalization. The former takes the Contextuality-by-Default principle as its departure point and is, in a sense, its straightforward extension; in the latter, Contextuality-by-Default is obtained as a byproduct.
To understand why the Contextuality-by-Default principle is associated with either of these two approaches, one should first of all abandon the naive notion that in KPT any two random variables have a joint distribution uniquely determined by their definitions. A random variable is a measurable function on a probability space, and the notion of a single probability space for all possible random variables (or, equivalently, the notion of a single random variable of which all other random variables are functions) is untenable. It contradicts the commonly used KPT constructions. (In this discussion we impose no restrictions on the domain and codomain probability spaces. A random variable therefore is understood in the broadest possible way, including random vectors, random functions, random sets, etc. We will avoid, however, the use of general measure-theoretic formalism.) One of them is, given any set, to construct a random variable whose range of possible values coincides with this set. A probability space on which all such random variables were defined would have to include a set of cardinality exceeding that of all possible sets, an impossibility.
Another commonly used construction is, given a random variable, to introduce another random variable that has a given distribution and is stochastically independent of the former. The use of this construction contradicts even the notion of a jointly distributed set of all variables with a particular distribution [2], say, the set Norm 0,1 ð Þ of all standard-normally distributed random variables. Indeed, if all random variables in Norm 0,1 ð Þ were jointly distributed, they would all be presentable as functions of some random variable N, the identity function on the probability space on which the random variables in Norm 0,1 ð Þ are defined. Choose now a standard-normally distributed random variable X so that it is independent of N. Then it is also independent of any Y [Norm 0,1 ð Þ. Since X cannot be independent of itself, X cannot belong to Norm 0,1 ð Þ. At the same time, X must belong to Norm 0,1 ð Þ due to its distribution. Short of imposing on KPT artificial constraints (such as an upper limit on cardinality of the random variables' ranges), these and similar contradictions can only be dissolved by allowing for stochastically unrelated random variables defined on different probability spaces (see Ref. [5] for how this can be built into the basic set-up of probability theory). The principle of Contextualityby-Default eliminates guesswork from deciding which random variables are and which are not jointly distributed. Irrespective of how one defines a system with random outputs and identifies the conditions under which these outputs are recorded, the outputs are jointly distributed if they are recorded under one and the same set of conditions; if they are recorded under different, mutually exclusive conditions, they are stochastically unrelated.

Two Approaches
Contextualization and conditionalization differ in how they ''sew together'' stochastically unrelated random variables. To demonstrate these differences on a simple example, let X and Y be random variables with z1={1 values, so that their distributions are determined by Pr X~1 ½ and Pr Y~1 ½ , respectively. Let X and Y be recorded under mutually exclusive conditions.
In contextualization (the approach we proposed in Refs. [1][2][3][4]), one first invokes the Contextuality-by-Default principle to treat X and Y as stochastically unrelated random variables. A ''sewing together'' of X and Y consists in probabilistically coupling them [6], i.e., presenting them as functions of a single random variable. Put differently (but equivalently), we create a random variable (vector) Z~X 0 ,Y 0 ð Þsuch that X 0 is distributed as X and Y 0 is distributed as Y . The variables X 0 and Y 0 are jointly distributed (otherwise Z would not be called a random variable, or a random vector), but this distribution is not unique. Thus, X and Y can always be coupled as stochastically independent random variables, so that They can also be coupled as identical random variables, but only if X and Y are distributed identically, There can, in fact, be an infinity of couplings, constrained only by In conditionalization, one creates a random variable C with two possible values corresponding to the two sets of conditions under which one records X and Y , respectively. Then one defines a random variable U~C,V ð Þ, such that the conditional distribution of V given C~1 is the same as the distribution of X , and the conditional distribution of V given C~2 is the same as the distribution of Y . In other words, The principle of Contextuality-by-Default here does not have to be invoked explicitly, but it is adhered to anyway: the random variable V is related to conditions under which it is recorded, and V conditioned on C~1 clearly has no joint distribution with V conditioned on C~2.
Conditionalization can also be implemented in more complex constructions, such as the one proposed in Ref. [7]. In our example, this construction amounts to replacing V with two random variables, V 1 and V 2 , and ''coordinating'' their possible values with the values of C. Thus, one can make V 1 and V 2 binary, z1={1, and define the conditional distributions by where v~1 or {1. For C~1, as we see, the ''relevant'' output is V 1 , and the probabilities of its values v are simply evenly divided between the two possible values of the ''irrelevant'' output V 2 (and for C~2, V 1 and V 2 exchange places). We argue in this paper that only contextualization serves as a useful tool for classifying and characterizing different types of systems involving random outputs that depend on conditions (e.g., classical-mechanical vs quantum-mechanical systems). Conditionalization, both in its simplest and modified versions, is always applicable but uninformative.

Quantum Entanglement
Our analysis pertains to any input-output relations, as considered in Refs. [1][2][3][8][9][10][11]. The relations can be physical, biological, behavioral, social, etc. For the sake of mathematical transparency, however, we confine our consideration to the canonical quantum-mechanical paradigm [12] involving two entangled particles, ''Alice's'' and ''Bob's.'' Alice measures the spin of her particle in one of two directions, a 1 or a 2 (values of the first input), and Bob measures the spin of his particle in one of two directions, b 1 or b 2 (values of the second input). Each pair of measurements is therefore characterized by one of four possible combinations of input values a i ,b j À Á , and it is these combinations that form the four conditions in this example. The spins recorded in each trial are realizations of random variables (outputs) A and B, which, in the simplest case of spin-1 =2 particles, can attain two values each: ''up'' or ''down'' (encoded by z1 and {1, respectively).
Aside from simplicity, another good reason for using this example is that it relates to the problem of great interest in the foundation of physics: in what way and to what an extent one can embed joint probabilities of spins in entangled particles into the framework of KPT? It may seem that this question was answered by John Bell in his classical papers [13,14], and that the answer was: KPT is not compatible with the joint distributions of spins in entangled particles. However, in Bell's analysis and its subsequent elaborations [15,16] the use of KPT is constrained by an added assumption that has nothing to do with KPT. Namely, the implicit assumption in these analyses is that of ''noncontextuality'': a spin recorded in Alice's particle is a random variable uniquely identified by the measurement setting (spatial axis) for which it is recorded (and analogously for Bob's particle).
In other words, the spin recorded by Alice for settings a 1 and a 2 are different random variables A 1 and A 2 , but the identity of either of them does not depend on whether Bob's setting is b 1 or b 2 (and analogously for Bob's random variables B 1 ,B 2 corresponding to b 1 and b 2 ). For well-established reasons (discussed in detail below), this makes a Kolmogorovian account of quantum entanglement impossible.
However, according to the Contextuality-by-Default principle, if one applies it to the Alice-Bob paradigm, any two random variables recorded under mutually exclusive conditions are labeled by these conditions and considered stochastically unrelated.
Alice's spin values recorded under the condition a 1 ,b 1 ð Þcannot co-occur with the spin values recorded by her under the condition a 1 ,b 2 ð Þ, even though a 1 is the same in both conditions. Therefore the identity of the spin she measures under a 1 ,b 1 ð Þ should be viewed as different from the identity of the spin she measures under a 1 ,b 2 ð Þ. This leads one to the double-indexation of the spins, where A ij and B ij are the measurements by Alice and Bob, respectively, recorded under the condition This vector of random variables cannot be called a random vector (or random variable, as we use the term broadly), because its components are not jointly distributed. Thus, A 11 and A 12 , or A 11 and B 12 , are recorded under mutually exclusive conditions, so they do not have jointly observed realizations. But the outputs A 11 and B 11 , being recorded under one and the same condition a 1 ,b 1 ð Þ, are jointly distributed, i.e., the joint probabilities for different combinations of co-occurring values of A 11 and B 11 are well-defined. The situation is summarized in the following diagram:

Contextuality and No-Signaling
Why do we speak of ''contextuality'' and ''noncontextuality''? The terms come from quantum mechanics (see, e.g., Refs. [17][18][19][20][21]), although it is not always clear that they are used in the same meaning as in the present paper. In the Alice-Bob paradigm with two spin-1 =2 particles, the (marginal) distribution of Alice's measurement A ij does not depend on Bob's setting b j , nor does the distribution of Bob's measurement B ij depend on a i : This is known as the no-signaling condition [22]: Alice, by watching outcomes of her measurements, is not able to guess Bob's settings, and vice versa. If the two particles are separated by a space-like interval, violations of no-signaling would contravene special relativity (and imply the ''spooky action at a distance,'' in Einstein's words).
Nevertheless, in KPT, A cannot be indexed by a i alone, nor can B be indexed by b j alone.
The logic forbidding single-indexation of the spins, A 1 ,A 2 ,B 1 ,B 2 , is simple [4]. Since, for any i,j, the random variables A i and B j are jointly distributed, they are defined on the same probability space. Applying this consideration to we are forced to accept that all four random variables, A 1 ,A 2 ,B 1 ,B 2 , are defined on one and the same probability space. The existence of this joint distribution, however, is known to be equivalent to Bell-type inequalities (see below), known not to hold for entangled particles.
Therefore, in perfect compliance with the Contextuality-by-Default principle, we are forced to use the double indexation (7). We can say that while b j does not influence A ij ''directly'' (which would be the case if b j could affect the distribution of A ij ), it generally creates a ''context'' for A ij . The context makes A i1 and A i2 two different random variables with one and the same distribution, rather than one and the same random variable. (Analogous reasoning applies to B ij in relation to a i .) It should not, of course, come as a surprise that different random variables can be identically distributed. After all, it is perfectly possible that the distributions of Alice's spins for a 1 and a 2 are identical too, and this would not imply that they are one and the same random variable. Within the framework of KPT, the difference between A 11 and A 12 is essentially the same as the difference between A 11 and A 21 : in both cases we deal with stochastically unrelated random variables, the only difference being that in the former pair, unlike in the latter one, the nosignaling condition forces the two random variables to be identically distributed. The notion of contextuality, however, does require broadening of one's thinking about how one decides that some empirical observations are and some are not realizations of one and the same random variable, as understood in KPT [2,3].

Contextualization and Couplings
Contextualization is a straightforward extension of the Contextuality-by-Default principle. The latter creates the eight random variables in (7), and the contextualization approach consists in directly imposing a joint distribution on them. This can, of course, be done in infinitely many ways. Any random variable such that, for any i~1,2 and j~1,2, is called a (probabilistic) coupling for (7) [6]. The fact that Y in (10) is referred to as a random variable (or random vector) implies that the components of Y are jointly distributed, i.e., there is a joint probability assigned to each of the referred to as an independent coupling. Its universal applicability leads to the common confusion of stochastic unrelatedness with stochastic independence. But stochastic independence is merely a special property of a joint distribution. The non-uniqueness of the coupling (10), rather than being a hindrance, can be advantageously used in theoretical analysis. According to the All-Possible-Couplings principle formulated in Refs. [2,3], a set of stochastically unrelated random variables is characterized by the set of all possible couplings that can be imposed on them, with no couplings being a priori privileged.
Thus, according to Ref. [1], the set of all possible couplings for (7) can be used to characterize various constraints imposed on the joint distributions of A ij ,B ij in (8).
From the point of view of all possible couplings, the noncontextuality assumption leading to the single-indexation of the spins, A i ,B j , is equivalent to imposing an identity coupling on the double-indexed outputs in (7), i.e., creating a coupling (10) - (11) with the additional constraint The Bell-type theorems [13][14][15][16] where S . . .T denotes expected value. Clearly, these inequalities do not have to be satisfied, and, in the Alice-Bob paradigm, for some quadruples of settings a 1 ,a 2 ,b 1 ,b 2 ð Þ , these inequalities are contravened by quantum theory and experimental data.
Therefore, we have to use double-indexing and consider couplings other than the identity coupling (12). This is the essence of the contextualization approach, when applied to the Alice-Bob paradigm. In the conditionalization approach, discussed next, one also uses what can be thought of as a version of double-indexation (conditioning on the two indices viewed as values of a random variable), but instead of the couplings in the sense of (10) -(11) one uses a different theoretical construct, conditional couplings.

Conditionalization and Conditional Couplings
One of the simplest ways of creating stochastically unrelated random variables is to consider a tree of possibilities, like this one: ð14Þ We have at the first stage outcomes a and b, and according as which of them is realized, the choice between c and d occurs with generally different probabilities. We can consider a and b as two mutually exclusive conditions, and use them to label the two random variables Clearly, X a and X b here do not have a joint distribution: e.g., no joint probability Pr X a~c ,X b~c ½ is defined because there is no commonly acceptable meaning in which X a~c may ''co-occur'' with X b~c . The two random variables here are stochastically unrelated, in conformance with the Contextuality-by-Default principle.
The All-Possible-Couplings principle leads us to consider all joint distributions Each r within this range defines a possible coupling Y~X 0 a ,X 0 b À Á for X a and X b . In particular, the independent coupling, with r~pq, is within the range, while the identity coupling, with Pr X a~Xb ½ 1, is possible if and only if r~p~q. There is, however, a more traditional view of X a and X b in (14). It consists in considering a joint distribution of two random variables, C and X , with the marginal distributions X a is then interpreted as X given C~a, and analogously for X b . The conditional probabilities are computed as required, The idea suggested by this simple exercise is this: consider any set of stochastically unrelated random outputs labeled by mutually exclusive conditions as if these conditions were values of some random variable, and the outputs were values of another random variable conditioned upon the values of the former.
We call this approach conditionalization. It may seem to provide a simple alternative, within the framework of KPT, to considering all couplings imposable on stochastically unrelated variables. We will argue, however, that this alternative is not theoretically interesting.
Consider the conditionalization of our Alice-Bob paradigm. Denote, for i~1,2 and j~1,2, Introduce a random variable C with four values and a random variable X~A 0 ,B 0 ð Þwith four values Form the tree of outcomes as shown below, using arbitrarily chosen positive probabilities p 11 ,p 12 ,p 21 ,p 22 (summing to 1): The conditionalization is completed by computing the joint distribution of C and A 0 ,B 0 ð Þ: Clearly, we have constructed a random variable This Z can be called a conditional coupling for The conditionalization procedure does not have to claim the existence of any ''true'' or unique distribution of C. One can freely concoct this distribution, even if the conditions under which A and B are measured are chosen at will or according to a deterministic algorithm.
There are two interesting modifications of conditionalization, both proposed in a recent paper by Avis, Fischer, Hilbert, and Khrennikov [7]. Instead of the conditional coupling Z in (24), they consider such that In other words, In one of them A 0 1 ,A 0 2 ,B 0 1 ,B 0 2 have two possible values each, +1, and That is, the probability of A 0 i~a ,B 0 j~b at C~a i ,b j À Á is evenly partitioned among the four values of the ''irrelevant'' pair . It is easy to see that one could as well use any other partitioning: with nonnegative t ij a 0 ,b 0 ð Þ subject to Now, for any distribution of C with non-zero values of Another way of implementing (28) described in Ref. [7] is to allow each of A 0 1 ,A 0 2 ,B 0 1 ,B 0 2 to attain a third value, say, 0, in addition to +1. This third value can be interpreted as ''is not defined.'' It is postulated then that Again, it is easy to see that the joint distribution of is well-defined and satisfies (28) for any distribution of C with non-zero values of Pr C~a i ,b j À Á Â Ã .

Comparing the Two Approaches
Conditionalization and contextualization achieve the same goal -''sewing together'' stochastically unrelated random variables within the confines of KPT. But the similarity ends there. Consider, e.g., the Alice-Bob experiment in which both Alice and Bob use some random generators to choose between two possible measurement directions. Clearly then C is objectively a random variable, and a joint distribution of A,B ð Þ and C objectively exists. Put differently, in this case A 0 ,B 0 ð Þ given (25) is simply equal to A ij ,B ij À Á . However, whether C is objectively a random variable or a distribution for the settings is invented, the quantum-mechanical analysis of the situation begins with computing the (conditional) distributions of A,B ð Þat different settings. The distribution of C in no way advances our understanding of how A ij ,B ij À Á for different i,j ð Þ are related to each other.
Thus, we know that the entangled spin-12 particles are subject to Tsirelson's inequalities [ We also know that if the two particles were not entangled, they would be subject to the Bell-CH-Fine inequalities (13). The difference between these two constraints is not reflected in the ''true'' distribution of C, if it exists, nor is it implied by or can in any way restrict the possible choices of ''imaginary'' distributions of C. In fact, the only restriction imposed on the distribution of C, a universal one, is that none of the conditions should have probability zero, because this would make the conditional probabilities undefined. Moreover, the set of possible conditional couplings is the same whether the no-signaling condition is or is not satisfied.
Although in this discussion we assumed that conditionalization was implemented in its simplest version, (24) -(25), our arguments and conclusions apply verbatim to the modifications proposed in Ref. [7] and described at the end of the previous section. The conditional distributions of A 0 1 ,A 0 2 ,B 0 1 ,B 0 2 for the four values of C in (29) and (31) are uniquely determined by the observed distributions of the four pairs A ij ,B ij À Á . But whatever these distributions, they can be paired with any distribution of C, provided none of its values has zero probability.
All of this stands in a clear contrast to the analysis of all possible couplings (10) in the contextualization approach [1][2][3][4]. In this approach we can ask various questions about the compatibility of couplings with various constraints known to hold for the observable joint distributions. Thus, we may ask about the fitting set of couplings for a given constraint (say, Bell or Tsirelson inequalities), i.e., the couplings that are compatible with the spin distributions subject to the constraint. We can also ask about the forcing set of couplings, those compatible only with the spin distributions subject to a given constraint. Or we can conjoin the two questions and ask about the equivalent set of couplings, those compatible with and only with the spin distributions subject to the constraint. The answers to such questions will be different for different constraints being considered.
Since the four observed joint distributions of A 0 ij ,B 0 ij in (11) are themselves part of the couplings (10), the questions above are only interesting if they are formulated in terms of the unobservable parts of the couplings.
In the examples below we characterize the couplings in terms of the connections [1,2,4], which are the (unobservable) pairs The diagram below shows the connections in their relation to the pairs whose joint distributions are known from observations (compare with diagram (8)): Let us assume that the probability of spin-up (z1) outcome for every (spin-1 =2) particle in the Alice-Bob paradigm is 1 =2. (As shown in Ref. [23], this can always be achieved by a simple procedural modification of the canonical Alice-Bob experiment.) This assumption is, of course, in compliance with the no-signaling condition, which therefore can be omitted from all formulations below.
We know [1] that the following two statements about connections are equivalent: (S 1 ) a vector of connections (33) is compatible with and only with those distributions of A ij ,B ij À Á , i,j[ 1,2 f g, that satisfy the Bell-CH-Fine inequalities (13); is equivalent to (S 0 1 ) a vector of connections (33) is such that where the number of + signs among the four expected values is 4,2, or 0.
The equivalence of these two statements is an expanded version of Fine's theorem [16], whose formulation in the language of connections is: the identity connections, those with We see that although the expectations SA i1 A i2 T and SB 1j B 2j T for the connections are not observable, they provide a theoretically meaningful way of characterizing the way in which the stochastically unrelated and observable A ij ,B ij À Á are being ''sewn together.'' And these ways are different for the Bell-CH-Fine and Tsirelson inequalities. What can contextualization tell us about the basic predictions of the quantum theory for the Alice-Bob experiment? The theory tells us that, for i~1,2 and j~1,2, where Sa i jb j T is the dot product of two unit vectors. It can be shown [25][26][27] that the four expectations SA ij B ij T can be presented in the form (39) using a quadruple of setting a 1 ,a 2 ,b 1 These inequalities are ''sandwiched'' between the Bell-CH-Fine ones and Tsirelson ones. That is, they are implied by the former and imply the latter. It is shown in Ref. [4] that (S 3 ) there is no vector of connections (33) that is compatible with and only with those distributions of A ij ,B ij À Á , i,j,[ 1,2 f g, that satisfy the quantum inequalities (40).
Moreover, this negative statement still holds if one replaces the connections (33) with any other subsets of (10), e.g., No distributions of such subsets are compatible with and only with those distributions of A ij ,B ij À Á that satisfy the quantum inequalities (40).
The investigation of the forcing set of couplings provides additional insights into the special nature of quantum mechanics. The result we have [4] says that the following two statements about connections are equivalent (note the change from ''with and only with'' of the previous statements to ''only with''): (S 4 ) a vector of connections (33) is compatible only with those distributions of A ij ,B ij À Á , i,j[ 1,2 f g, that satisfy the quantum inequalities (40); is equivalent to (S 0 4 ) a vector of connections (33) is compatible only with those distributions of A ij ,B ij À Á , i,j[ 1,2 f g, that satisfy the Bell-CH-Fine inequalities (13).
In other words, a choice of connections can force all A ij ,B ij À Á compatible with them to comply with quantum mechanics only in the form of their compliance with classical mechanics.

Conclusion
The examples just given should suffice to illustrate the point made: while both contextualization and conditionalization embed any input-output relation into the framework of KPT, only contextualization provides a useful tool for understanding the nature of various constraints imposed on the observable joint distributions (one could say also, for different types and levels of contextuality). Conditionalization is uninformative, as any distribution of the conditions is compatible with any distributions of the conditional random variables.