Improved Measures of Integrated Information

Although there is growing interest in measuring integrated information in computational and cognitive systems, current methods for doing so in practice are computationally unfeasible. Existing and novel integration measures are investigated and classified by various desirable properties. A simple taxonomy of Φ-measures is presented where they are each characterized by their choice of factorization method (5 options), choice of probability distributions to compare (3 × 4 options) and choice of measure for comparing probability distributions (7 options). When requiring the Φ-measures to satisfy a minimum of attractive properties, these hundreds of options reduce to a mere handful, some of which turn out to be identical. Useful exact and approximate formulas are derived that can be applied to real-world data from laboratory experiments without posing unreasonable computational demands.


I. INTRODUCTION
What makes an information-processing system conscious in the sense of having a subjective experience?Although many scientists used to view this topic as beyond the reach of science, the study of Neural Correlates of Consciousness (NCCs) has become quite mainstream in the neuroscience community in recent years -see, e.g., [1,2].To move beyond correlation to causation [3], neuroscientists have begun searching for a theory of consciousness that can predict what physical phenomena cause consciousness (defined as subjective experience [3]) to occur.Dehaene [4] reviews a number of candidate theories currently under active discussion, including the Nonlinear Ignition model (NI) [5,6], the Global Neuronal Workspace (GNW) model [7][8][9] and Integrated Information Theory (IIT) [10,11].Rapid progress in artificial intelligence is further fueling interest in such theories and how they can be generalized to apply not only to biological systems, but also to engineered systems such as computers and robots and ultimately arbitrary arrangements of elementary particles [12].
Although there is still no consensus on necessary and sufficient conditions for a physical system to be conscious, there is broad agreement that it needs to be able to store and process information in a way that is somehow integrated, not consisting of nearly independent parts.As emphasized by Tononi [10], it must be impossible to decompose a conscious system into nearly independent parts -otherwise these parts would feel like two separate conscious entities.While integration as a necessary condition for consciousness is rather uncontroversial, IIT goes further and makes the bold and controversial claim that it is also a sufficient condition for consciousness, using an elaborate mathematical integration definition [11].
As neuroscience data improves in quantity and quality, it is timely to resolve this controversy by testing the many experimental predictions that IIT makes [11] with stateof-the-art laboratory measurements.Unfortunately, such tests have been hampered by the fact that the integration measure proposed by IIT is computationally infeasible to evaluate for large systems, growing super-exponentially with the system's information content.This has lead to the development of various alternative integration measures that are simpler to compute.For example, Barrett & Seth [13] proposed an attractive integration measure that is easier to compute from neuroscience data, but whose interpretation is complicated by the fact that it can be negative in some cases.[14] used an integration measure inspired by complexity theory to successfully predict who was conscious in a sample including patients who were awake, in deep sleep, dreaming, sedated and with locked-in syndrome.Even the team behind IIT has updated their integration measure twice through successive refinements of their theory [10,11].Despite these definitional and computational challenges, interest in measuring integration is growing, not only in neuroscience but also in other fields, ranging from physics [12] to the study of collective intelligence in social networks [15].
It is therefore interesting and timely to do a comprehensive investigation of existing and novel integration measures, classifying them by various desirable properties.This is the goal of the present paper, as summarized in Table I and Table II.The rest of this paper is organized as follows.In Section II, we investigate general integration measures and their properties.In Section IV, we derive useful formulas for many of these measures that can be applied to the sort of time-series data with continuous variables that is typically measured in laboratory experiments.We explore further algorithmic speedups and approximations in Section V and summarize our conclusions in Section VI.

II. MEASURES OF INTEGRATION
Following Tononi [10], we will use the symbol Φ to denote integrated information.All measures of Φ aim to quantify the extent to which a system is interconnected, yielding Φ = 0 if the system consists of two independent parts, and a larger Φ the more the parts affect each other.φ 2.0 φ 2.0 φ 2.0 φ 3.0 φ M φ B φ M kk φ oakk φ opkk φ otsk φ of uk φ nask φ mask φ xf kk  indicate when an integration measure lacks a desirable property or has an undesirable one.The first four properties are generally agreed to be important, while the second set of four have been argued to be important by some authors.φ M ≡ φ otuk and φ M kk ≡ φ ofkk kk .
Name Definition Formula for Gaussian variables Mathematically, all Φ-measures are defined in a two-step process: 1. Given an imaginary cut that partitions the system into two parts, define a measure φ of how much these two parts affect each other.Table II lists many φ-options.
2. Define Φ as the φ-value for the "cruelest cut" that minimizes φ.A major numerical challenge is that the number of cuts to be minimized over grows super-exponentially with the number of bits in the system.A further challenge in this step is how to best handle cuts splitting the system into parts of unequal size.
Before delving into the many different options for defining Φ, let us first introduce convenient notation general enough to describe all proposed integration measures, as illustrated in Figure 1.

Markov process M
x A FIG. 1: We model the time-evolution of the system state as a Markov process defined by a transition matrix M: when the (possibly unknown) system state evolves from x0 to x1, the corresponding probability distribution evolves from p0 to p1 ≡ Mp0.All competing definitions of Φ quantify the inability to tensor factorize M, which corresponds to approximating the system as two disconnected parts A and B that do not affect one another.

A. Interpreting evolution as a Markov process
Consider two random vectors x 0 and x 1 whose joint probability distribution is p(x 0 , x 1 ).We will interpret them as the state of a time-dependent system x(t) at two separate times t 0 and t 1 .For example, if these are two vectors of 5 bits each, then p is a table of 2 10 numbers giving the probability of each possible bit string, while if these are two vectors in 3D space, then p is a function of 6 real continuous variables.We obtain the marginal distribution p (n) (x n ) for the n th vector, where n = 0 or n = 1, by summing/integrating p over the other vector.
Below we will often find it convenient to denote these vectors as single indices i = x 0 and j = x 1 .For example, this allows us to write the marginal distribution p 0 (x 0 ) as j p ij , where the sum over j is to be interpreted as summation/integration over all allowed values of x 1 .We also adopt the notation where replacing an index by a dot means that this index is to be summed/integrated over.This lets us write the marginal distributions p (0) (x 0 ) and p (1) (x 1 ) as As illustrated in Figure 1, it is always possible to model this relation between x 0 and x 1 as resulting from a Markov process, where x 1 is causally determined by a combination of x 0 and random effects.If we write the marginal distributions from equation (1) as vectors p (0) and p (1) , this Markov process is defined by where the Markov matrix M ji specifies the probability that a state i transitions to a state j, and satisfies the conditions M ji ≥ 0 (non-negative transition probabilities) and M •i = 1 (unit column sums, guaranteeing probability conservation).The standard rule for conditional probabilities gives which uniquely determines the Markov matrix as which is seen to satisfy the Markov requirements M ji ≥ 0 and M •i = 1.Note that any system obeying the laws of classical physics can be accurately modeled as a Markov process as long as the time step ∆t ≡ t 1 − t 0 is sufficiently short (defining x(t) as the position in phase space).If the process has "memory" such that the next state depends not only on the current state but also on some finite number of past states, it can re reformulated as a standard memoryless Markov process by simply expanding the definition of the state x to include elements of the past.

B. A taxonomy of integration measures
We will now see that this Markov process interpretation allows us create a simple taxonomy of integration measures φ that quantify the interaction between two subsystems.The idea is to approximate the Markov process by a separable Markov process that does not mix information between subsystems, and to define the integration as a measure of how bad the best such approximation is.Consider the system x as being composed of two subsystems x A and x B , so that the elements of the vector x are simply the union of the elements of x A and x B , and let us define the probability distribution The Markov matrix of equation ( 4) then takes the form The Markov process of equation ( 2) is separable if the Markov matrix M is a tensor product M A ⊗ M B , i.e., if for Markov matrices M A and M B that determine the evolution of x A and x B .If our system is integrated so that M cannot be factored as in equation (7), we can nonetheless choose to approximate M by a matrix of the factorizable form M A ⊗ M B .If we retain the initial probability distribution p ii •• for x 0 but replace the correct Markov matrix M by the separable approximation M A ⊗ M B , then equation (6) shows that the probability distribution gets replaced by the probability distribution q ii jj given by which is an approximation of p ii jj .If M is factorizable (meaning that there is no integration), we can factor M such that the two probability distributions q ii jj and p ii jj are equal and, conversely, if the two probability distributions are different, we can use how different they are as an integration measure φ.
To define an integration measure φ in this spirit, we thus need to make four different choices, which collectively specify it fully and determine where the φ-measure belongs in our taxonomy: 1. Choose a recipe defining an approximate factorization 2. Choose which probability distributions p and q to compare for exact and approximate M (the distribution for x, x 1 or x A 1 , say). 3. Choose what to treat as known about p ii •• when computing these probability distributions.
4. Choose a metric for how different the two probability distributions p and q are.These four options are described in Tables III, IV and V, and we will now explore them in detail.

C. Options for approximately factoring M
Table III lists five factoring options which all have attractive features, and we will now describe each in turn.

Approximately factoring M using noising
The first option corresponds to the "noising" method used in IIT [10]: the time evolution of one part of the system (x A , say) is determined from the past state x A 0 alone, treating x B 0 as random noise with some probability distribution p (B0) that is independent of x A 0 .In other words, we replace the initial probability distribution p . We will now see that if we start with equation (2), i.e., the Markov equation p (1) = Mp (0) , then this noising prescription gives p (A1) = M A p (A0) for a particular matrix M A .Equation (2) states that and substituting the separable "noising" form of p (0) ii from above gives where we have defined IIT chooses the noise to have maximum entropy, i.e., a uniform distribution over the n B possible states of subsystem B [10]: Table III lists the M A -matrix corresponding to this noising choice as well as the analogous M B -matrix.

Approximately factoring M using mild noising
One drawback of this choice is that uniform distributions are undefined for continuous variables such as measured voltages, because they cannot be normalized.This means that any φ-measure based on this noising factorization is undefined and useless for continuous systems.This problem can be solved by adopting another natural choice for the noise distribution: i.e., simply the marginal distribution for x B 0 .We term this option "mild noising", since the noise is less extreme (its entropy is lower) than with the previous noising option.Table III lists the M A -matrix corresponding to this mild noising choice as well as the analogous M B -matrix.
Code Factorization method State-dependent? n Noising Optimal given x 0 , on average

Optimally factoring M
A drawback of both factorizations that we have considered so far is that they might overestimate integration: there may exist an alternative factorization that is better in the sense of giving a smaller φ.The natural way to remedy this problem is to define φ by minimizing over all factorizations.This elegantly unifies with the fact that capital Φ is defined by minimizing over all partitions of the system into two parts: we can capture both minimizations by simply saying "minimize over all factorizations", since the choice of a tensor factorization includes a choice of partition.
In practice, the definition of the optimal factorization depends on what we optimize.We discuss various options below, and identify three particularly natural choices which are listed in Table III.The first option makes the approximate probability distribution q ii jj as similar as possible to p ii jj , where similarity is quantified by KLdivergence.The second option treats the present state x 0 as known and makes the conditional probability distribution for the future state x 1 as similar as possible to the correct distribution.This factorization thus depends on the state and hence on time, whereas all the others we have considered are state-dependent.The third option is the factorization that minimizes this state-dependent φ on average; we will prove below that this factorization is identical to the first option.
In summary, Table III lists five factorization options that each have various attractive features; options 3 and 5 turn out to be identical.It is easy to show that if the Markov matrix M is factorizable (which means that the probability distribution is separable as p ii jj = p A ij p B i j ), then all five factorizations coincide, all giving This means that they will all agree on when φ = 0; otherwise the noising factorizations will yield higher φ than an optimized factorization.

D. Options for which probability distributions to compare
Table IV lists four options for which probability distributions p and q to compare.Arguably the most natural option is to simply compare the full distributions p ii jj and q ii jj that describe our knowledge of the system at both times (the present state and the future state).Another obvious option is to merely compare the predictions, i.e., the probability distributions p ••jj and q ••jj for the future state.A third interesting option is to compare merely the predictions for one of the two subsystems (which we without loss of generality can take to be subsystem A), thus comparing p ••j• and q ••j• .
Generally, the less we compare, the easier it is to get a low φ-value.To see this, consider a system where A affects B but B has no effect on A. We could, for example, consider A to be photoreceptor cells in your your retina and B to be the rest of your brain.Then the second comparison option ("f") in Table IV would give φ > 0 because we predict the future of your brain worse if we ignore the information flow from your retina, while the third comparison option ("a") in the table would give φ = 0 because the rest of your brain does not help predict the future of your retina.In other words, comparison option "a" makes φ vanish for afferent pathways, where information flows only inward toward the rest of the system.
IIT argues that any good φ-measure indeed should vanish for afferent pathways, because a system can only be conscious if it can have effects on itself -other systems that it is affected by without affecting will act merely as parts of its unconscious outside world [10].Analogously, IIT argues that any good φ-measure should vanish also for efferent pathways, where information flows only outward away from the rest of the system.The argument is that other systems that the conscious system affects without being affected by will again be unconscious, acting merely as unconscious parts of the outside world as far as the conscious system is concerned.
Option "p" in Table IV has this property of φ vanishing for efferent pathways.It is simply the time-reverse of option "a", quantifying the ability of x A 1 to determine its past cause x A 0 instead of quantifying the ability of x A 0 to determine its future effect x A 1 .To formalize this, consider that there is nothing in the probability distribution p ii jj that breaks time-reversal symmetry and says that causation goes from t 0 to t 1 rather than vice versa.In complete analogy with our Different options for which probability distributions p and q to compare.The last three columns specify the formula for q for the three conditioning options we consider: when the state x0 is unknown (u), has a separable probability distribution (s) and is known (k), respectively.
formalism above, we can therefore define a time-reversed Markov process M whereby the future determines the past according to the time-reverse of equation ( 2): where equations ( 6) and ( 9) get replaced by and jj .
This time reversal symmetry doubles the number of qoptions we could list in Table IV to six in total, augmenting q ii jj , q ••jj and q ••j• by qii jj , q••jj and q••j• .In the interest of brevity, we have chosen to only list q••j• , because of its ability to kill φ for efferent pathways -the formulas for the two omitted options are trivially analogous to those listed.

E. Options for what to treat as known about the current state
Above we listed options for which probabilities p and q to compare to compute φ.To complete our specification of these probabilities, we need to choose between various options for our knowledge of the present state; the three rightmost columns of Table IV correspond to three interesting choices.
The first option is where the state is unknown, described simply by the probability distribution we have used above: This corresponds to us knowing M, the mechanism by which the state evolves, but not knowing its current state x 0 .Note that a generic Markov process eventually converges to a unique stationary state p = p (0) = p (1)   which, since it satisfies Mp = p, can be computed directly from M as the unique eigenvector whose eigenvalue is unity.This means that if we consider a system that has been evolving for a significantly long time, its full two-time distribution p ii jj is determined by M alone; conversely, p ii jj determines M through equation ( 6).Alternatively, if p ii jj is measured empirically from a time-series x t which is then used to compute M, we can use equation ( 18) to describe our knowledge of the state at a random time.
A second option is to assume that we know the initial probability distributions for x A 0 and x B 0 , but know nothing about any correlations between them.This corresponds to replacing equation ( 18) by the separable distribution and can be advantageous for φ-measures that would conflate integration with initial correlations between the subsystems.
A third option, advocated by IIT [10], is to treat the current state as known: i.e., we know with certainty that the current state x 0 = kk for some constants k and k .IIT argues that this is the correct option from the vantage point of a conscious system which, by definition, knows its own state.
A natural fourth option is a more extreme version of the first: treating the state not merely as unknown, with p (0) given by its ensemble distribution, but completely unknown, with a uniform distribution: Although straightforward enough to use in our formulas, we have chosen not to include this option in Table IV because it is rather inappropriate for most physical systems.
For continuous variables such as voltages, it becomes undefined.For brains, such maximum-entropy states never occur: they would have typical neurons firing about half the time, corresponding to much more extreme "on" behavior than during an epileptic seizure.Finally, please note that if we choose to determine the past rather than the future (the "p"-option from the previous section and Table IV), then all the choices we have described should be applied to p (1) ij rather than p (0) ij .

F. Options for comparing probability distributions
The options in the past three sections uniquely specify two probability distributions p and q, and we want the integration φ to quantify how different they are from one another: for some distance measure d that is larger the worse q approximates p.There are a number of properties that we may consider desirable for d to quantify integration: 1. Positivity: d(p, q) ≥ 0, with equality if and only if p = q.

Monotonicity:
The more different q is from p in some intuitive sense, the larger d(p, q) gets.
3. Interpretability: d(p, q) can be intuitively interpreted, for example in terms of information theory.
4. Tractability: d(p, q) is easy to compute numerically.Ideally, the optimal factorizations from Section II C can be found analytically rather than through time-consuming numerical minimization.
Any distance measure d meets the mathematical requirements of being a metric on the space of probability distributions if it obeys positivity, symmetry and the triangle inequality d(p, q) ≤ d(p, r) + d(r, q).Table V lists five interesting probability distribution distance measures d(p, q) from the literature together with their definitions and properties.All these measures are seen to have the positivity and monotonicity, and all except the first are also symmetric and true metrics.We will now discuss them one by one in greater detail.
The distance d KL is the Kullback-Leibler divergence, and measures how many bits of information are lost when q is used to approximate p, in the sense that if you developed an optimal data compression algorithm to compress data drawn from a probability distribution q, it would on average require d KL (p, q) more bits to compress data drawn from a probability distribution p than if the algorithm had been optimized for p [16].
d 1 and d 2 measure the distance between the vectors p and q using the L 1 -norm and L 2 -norm, respectively.The former is particularly natural for probability distributions p, since they all have L 1 norm of unity: The measure d H is the Hilbert-space distance: if, for each probability distribution, we define a corresponding wavefunction ψ i ≡ p 1/2 i , then all wavefunctions lie on a unit hyperspere since they all have unit length: < ψ|ψ >= p .= 1.The distance d H is simply the angle between two wavefunctions, i.e.., the distance along the great circle on the hypersphere that connects the two, so d H (p, q) ≤ π/2.It is also the geodesic distance of the Fisher metric, hence a natural "coordinate free" distance measure on the manifold of all probability distributions.
The measure d H is the Earth-Movers distance [17].If we imagine piles of earth scattered across the space x, with p(x) specifying the fraction of the earth that is in each location, then d EM is the average distance that you need to move earth to turn the distribution p(x) into q(x).The quantity d ij in the definition in Table V specifies the distance between points i and j in this space.For example, if x is a 3D Euclidean space, this may be chosen to be simply the Euclidean metric, while if x is a bit string, d ij may be chosen to be the L 1 "Manhattan distance", i.e., the number of bit flips required to transform one bit string into another.IIT 3.0 argues that the earth mover's distance d EM is the most appropriate measure d on conceptual grounds (whereas IIT 2.0 advocated d KL ).Unfortunately, d EM rates poorly on the tractability criterion.It's definition involves a linear programming problem which needs to be solved numerically, and even with the fastest algorithms currently available, the computation grows faster than quadratically with the number of system states -which in turn grows exponentially with the number of bits.For continuous variables x, the number of states and hence the computational time is formally infinite.

A. Optimal factorization with dKL
Our taxonomy of integration measures is determined by four choices: of factorization, variable selection, conditioning and distance measure.Although we have now explored these four choices one at a time, there are important interplays between them that we must examine.First of all, the three optimal factorization options in Table III depend on what is being optimized, so let us now explore which of these optimizations are feasible and interesting to perform in practice and let us find out what the corresponding factorizations and φ-measures are.
The mathematics problem we wish to solve is i.e., minimizing d(p, q) over M A and M B given the constraints that M A and M B are markov Matrices: TABLE V: Different options for measuring the difference d between two probability distributions p and q.In the text, we considered options where p and q had one, two or four indices, but in this table, we have for simplicity combined all indices into a single Greek index α.
depends on M A and M B , while Table V specifies the options for computing the distance measure d.We enforce the column sum constraints using Lagrange multipliers, minimizing and need to check afterwards that all elements of M A and M B come out to be non-negative (we will see that this is indeed the case).
As mentioned, numerical tractability is a key issue for integration measures.This means that it is valuable if the Lagrange minimization can be rapidly solved analytically rather than slowly by numerical means, since this needs to be done separately for large numbers of possible system partitions.We find that there is only one d-option out of the above-mentioned five for which the optimization over M-factorizations can be solved analytically: the KL-divergence d KL .The runner-up for tractability is d 2 , for which everything can be solved analytically except for a final column normalization step, but the resulting formulas are cumbersome and unilluminating, falling foul of the interpretability criterion.Although d KL lacks the symmetry property, it has the above-mentioned positivity, monotonicity and interpretability properties, and we will now show that it also has the tractability property.
Let us begin with the q-options in the upper left corner of Table IV, i.e., comparing the two-time distributions treating the present state as unknown.Substituting equation ( 9) into the definition of d KL from Table V gives where the entropy for a random variable x with probability distribution p is given by Shannon's formula [18] S To avoid a profusion of notation, we will often write as the argument of S a random variable rather than its probability distribution.For convenience, we will take all logarithms to be in base 2 for discrete distributions (so that entropies are measure in units of bits) and in base e for continuous Gaussian distributions (so that equations get simpler).In the latter case, where the entropy is based on the natural logarithm, entropy is measured in "nits" which equal 1/ ln 2 ≈ 1.44 bits.Substituting equation (25) into equation ( 24) and requiring vanishing derivatives with respect to M A ij , M B i j , λ j and µ j shows that the solution to our minimization problem is We recognize these equations as simply the Markov matrix estimator from equation (4) applied separately to subsystems A and B after marginalizing over the other system.Substituting this back into equation (9) gives Substituting this back into the definition of d KL gives the extremely simple result that the integration is where the mutual information between two random variables is given in terms of entropies by the standard definition Since we will be deriving a large number of different φ-measures that we do not with to conflate with one another, we superscript each one with four code letters denoting the four taxonomical choices that define it.These letter codes are 1.Factorization: n/m/o/x/a 2. comparison: t/f/a/p 3. conditioning: u/s/k 4. measure: k/1/2/h/e and are defined in Tables III, IV and II.For example, the integration measure φ otuk from equation (29) denotes optimized (o) factorization comparing the two-time (t) probability distributions with the current state unknown (u) and KL-divergence (k).
Although we derived this optimal factorization for by comparing the two-time distribution (option t) for an unknown state (option u) , an analogous calculation leads to the exact same optimal factorization for the options a+u, s+f and a+s.The option t+s is undefined and the option f+u gives messy equations I have been unable to solve analytically.It is therefore reasonable to view equation (27) as the optimal factorization when the state is unknown (option o), and for the remainder of this paper, we will simply define the o-option as using the factorization given by equation ( 27).
Note that our result in equation ( 29) involves a timeasymmetry, singling out t 0 rather than t 1 in the second term.This is because we chose to interpret our Markov process as operating forward in time, determining the state at t 1 from the state at t 0 .As we discussed in Section II D, we could equally well have done the opposite, using the Markov process M operating backward in time, which would have yielded the alternative integration measure In practice, one usually estimates all statistical properties from a time-series that is assumed to be stationary.This means that I(x A 0 , x B 0 ) = I(x A 1 , x B 1 ), so that the these two integration measures become identical.

B. Comparison with the Barrett/Seth integration measure
It is interesting to compare our result in equation ( 29) with the popular integration measure proposed by Barrett & Seth [13].The intuition behind this definition is to take the amount of information that a system predicts about its future and subtract of the information predicted by both of its subsystems.Unfortunately, the result can sometimes go negative, violating the desirable positivity property and making the φ B difficult to interpret1 .Consider the simple example of two independent bits that never change.If they start out perfectly correlated, then they will remain perfectly correlated, giving By substituting equation (30) into equations ( 29) and (32), we find that In other words, we can make the Barrett/Seth measure non-negative by adding back any initial mutual information between the two subsystems.When this is done, it becomes the integration measure we derived, therefore having a simple information-theoretic interpretation: it is the KL-divergence between the actual probability distribution p and the best separable approximation, which is guaranteed to be non-negative.

C. Optimal state-dependent factorization
Let us now turn to factorization option "x", optimized knowing the current state.Consider some conscious observer (perhaps the system itself) who knows nothing about the system except its dynamics (encoded in M) and its state at the present instant, encoded in x 0 = kk .What can this observer say about the system state at earlier and later times?How integrated will this observer feel that the system is?To answer this question, we simply want to find the best approximate factorization of the conditional future state M jj kk (or the past state M kk ii ), where k and k are known constants.
To gain intuition for this, let us temporarily write this conditional distribution as p ii , suppressing the known parameters kk for simplicity.Given an arbitrary bivariate probability distribution p ii , what is best separarable approximation q ii ≡ a i b i in the sense that it minimizes d KL (p, q)?By minimizing d KL (p, q) using Lagrange multipliers, one easily obtains the long-known result that a i = p i. , b i = p .i and d KL (p, q) = I, the mutual information of p.In other words, even if we had never heard of marginal distributions or mutual information, we could derive them all from d KL : the best factorization simply uses the marginal distributions, and the mutual information of a bivariate distribution is simply the KL-measure of how non-separable it is.
This means that the optimal factorization given k and k is simply the one giving the marginal conditional distributions and the corresponding integration is simply φ xtkk is identical.We can alternatively obtain this result directly from equation (29) by noting that the I(x A 0 , x B 0 )term vanishes now that the state x 0 is known.This result highlights a striking and arguably undesirable feature of measures based on the x-factorization option: they vanish for any deterministic system!If the system is deterministic and the present state x 0 is known, then the future state x 1 is also known, so all entropies in equation ( 35) vanish and we obtain φ = 0.With φ-measures based on x-factorization, the only source of integration is therefore correlated noise generated by the system.

D. Minimizing integration on average
Let us now turn to our final factorization option, "a", where we pick the state-independent factorization that minimizes integration on average.Given the present state x 0 = kk , let us compare the exact and approximate future probability distributions by computing their KL-divergence φ = d KL (p, q).The answer clearly depends on the present state kk , and we saw in the previous section what happens when we minimize separately for each state kk .Let us now instead average d KL (p, q) over all current states and find the stateindependent factorization that minimizes this average: Substituting equation (6) shows that this expression is identical to that from equation (25), so minimizing it gives the exact same optimal factors M A and M B and the exact same minimum φ.The comparison option "t" gives the same result as well, so in conclusion, although they appear quite different from their definitions, the factorization options "o" and "a" are in fact identical.

E. The full taxonomy
Now that we have derived the explicit form of all our factorization options, we can complete our integration measure classification.Our taxonomy is determined by four choices: of factorization (n/m/o/x/a), variable selection (t/f/a/p), conditioning (u/s/k) and distance measure (k/1/2/h/e).Although this nominally gives 5×4×3×5 = 300 different integration measures, most of these options turn out to be zero, undefined or identical to other options. 2hereas there are strong interactions between the factorization, variable selection and conditioning, we can freely choose any of the 5 distance measures independently of the other choices without changing whether φ vanishes or is well-defined.We consider the option k (KL-divergence) by default below since it results in the simplest and most intuitive formulas; the formulas for the other options are straightforward to derive by combining Tables III, IV and V.This leaves us with only the 20 separate options shown in Table II to consider.

F. Which integration measures are best?
Table I summarizes the desirable and undesirable traits for each of these integration measures, showing that merely a handful lack any major drawbacks.Let us now rate the various options in more detail.
For the choice of probability distance measure (k/1/2/h/e), option "e" (the Earth-Mover's distance d EM used in φ 3.0 [11]) remains an attractive candidate for discrete distributions with small number of bits, but is otherwise computationally unfeasible as we discussed above.All options in Table I except φ 3.0 therefore use option "k" (the KL-divergence).Note that whether it is an advantage for the probability distance measure to be symmetric (as advocated in [11]) depends on the interpretational context.For example, there is nothing asymmetric about the mutual information that ends up defining φ M in Table II.
For the choice of factorization (n/m/o/x/a), we can quickly dispense with option "a" (for being identical to "o") and option "x" (because it has the highly undesirable property of always vanishing for deterministic systems).Which of the remaining options (n/m/o) is preferable depends on other choices.If one wishes to use a distance measure other than the KL-divergence, then the noising options "n" or "m" are computationally preferable, since the optimal factorization "o" can no longer be found analytically.Otherwise, "m" is arguably inferior to "o" because it is no simpler to evaluate and can overestimate the integration as described above.If one has a philosophical preference for the factorization depending only on the mechanism M and not on any other information about state probabilities, then "n" is the only choice.If one wishes to consider continuous systems, on the other hand, "n" is undefined.In summary, φ nas * , φ nak * , φ nps * , φ nap * , φ mas * , φ mak * , φ mps * and φ map * , where * denotes any option for the distance measure.For ofactorization, it is straightforward to show that φ oau * = φ oas * = φ opu * = φ ops * = 0 and φ otk * = φ of k * .For x-factorization, φ xt * * is undefined and one easily shows that φ xak * = φ xpk * = 0, φ xau * = φ xaf * and φ xpu * = φ xpf * .We interpret k-conditioning as x 0 being known for o-factorization and as x A 0 being known for noising factorizations, since the reverse options vanish and are undefined, respectively.the best factorizations are therefore"o" and "n", depending ones preferences.In practice, numerical experiments show that "n", 'm" and "o" usually give quite similar φ-values for a wide range of M-matrices and probability distributions, so the choice between the three is a relatively minor one.
Turning now to the choice variable selection and conditioning, Table I shows that many of the otherwise well-defined integration measures from Table II have serious flaws.
Neither φ otsk and φ ofsk are guaranteed to vanish for separable systems, which means that we cannot in good conscience interpret them as measures of integration.Numerical experiments show that φ nask , φ npsk , φ mask and φ npsk tend to be extremely small in practice (φ mask is plotted in Figure 2).This is because they differ little from the corresponding measures using optimal factorization (φ oask and φ opsk ), which always vanish.In other words, they are not really measures of integration, merely measures of how suboptimal the factorizations "n" and "m" are.For brevity, we have included merely three of these six flawed measures in Table I.
Figure 2 shows that φ ofuk also tends to be much smaller than some other integration measures.We can intuitively understand this by recalling that φ oauk = 0, which means that optimal factorization lets us predict the future marginal distributions for A and B perfectly.Since φ ofuk quantifies the inability of optimal factorization to predict the full future distribution, we expect that it will at most be of the order of I(x A 1 , x B 1 ), the extent to which this distribution is not separable (determined by its marginal distributions).For randomly generated probability distributions as in Figure 2), one can show that I(x A 1 , x B 1 ) → 1 − 1/2 ln 2 ≈ 0.28 bits in the limit where n → ∞, and numerical experiments indicate that φ ofuk is never much larger than this value for any p.
Dispensing with flawed/problematic φ-measures narrows our list of remaining top candidates to merely eight: φ otuk , φ ofkk , φ oakk , φ opkk , φ nakk , φ opkk , φ makk and φ opkk .Morover, the last six can be elegantly combined into merely three even better ones.As we discussed above, they have the advantage that they vanish for either afferent or efferent systems.
By following the prescription of [10] and taking the minimum of two such complementary measures, we can construct an even better one one that vanishes for both afferent and efferent systems.All three of these improved measures are listed in Table II.The first is φ 2.0 ≡ min{φ nakk , φ npkk }, corresponding to the measure of IIT2.0.The second is φ 2.0 ≡ min{φ makk , φ mpkk }, which has the advantage of remaining defined even for continuous variables.The third is φ 2.0 ≡ min{φ makk , φ mpkk }, which uses the optimal factorization.Subsystem size n 2: Numerical comparison of different integration measures, averaged over 3,000 random trials.In the bottom panel, all elements of p are independently drawn from a uniform distribution and normalized to sum to unity.In the top panel, only p (0) is randomly generated, and M is defined so as to swap the two subsystems, i.e., M jj ii = δ ij δ i j .

G. How large can φ get?
In summary, our taxonomy of φ-measures produces merely a handful of truly attractive options: φ 2.0 , φ 2.0 , φ 2.0 , φ 3.0 , φ M and φ M kk .Figure 2 shows examples of what they evaluate to numerically.The lower panel shows that for randomly generated probability distributions, none of them exceed 1−1/2 ln 2 ≈ 0.28 bits on average, which as mentioned above is the mutual information in a random bivariate distribution.However, φ 2.0 , φ 2.0 , φ 2.0 , φ M and φ M kk can get arbitrarily large for some systems, as illustrated in the top panel, growing logarithmically with the size n of the subsystems A and B. In other words, the maximum integration is of order the number of subsystem bits.For the example shown where the dynamics merely swaps the two subsystems, we we obtain φ 2.0 = log 2 n, because noising gives M A = 1/n, q = 1/n 2 and p is a Kronecker δ. φ M and φ M kk are seen to give about twice the integration for this example.
Note that although this dynamics M that merely swaps the subsystems has such a large φ-value only for this particular cut that separates the systems being swapped.There is a different cut where φ = 0: simply define the new subsystems A' and B' to be the first and second halves of the A and B-systems.The swapping can be carried out internally within A' and B', revealing that there is no integration and upper-case Φ = 0.
However, there are plenty of systems for which even the true integration Φ grows like the number of subsystem bits, log 2 n.A simple example accomplishing this (in the spirit of the random coding example in [12]) is when the n 4 probabilities p ii jj are all set to zero except for a randomly selected subset of n 2 of them that are set to 1/n 2 .Now φ M ∼ log 2 n even when minimized over all bipartitions of the 2 log 2 n bits in the system.

IV. THE n → ∞ LIMIT OF CONTINUOUS VARIABLES
All our previous results are fully general, applying regardless of whether the variables are discrete (such as bits that equal zero or one) or continuous (such as voltages or other variables measured in fMRI, EEG, MEG or electrophysiology studies).We can view the latter as the n → ∞ limit of the former, since a single real number can be represented as an infinite string of bits.In this section, we will focus on the continuous case and see how our previous formulas can be greatly simplified by assuming Gaussianity.We therefore replace i, i , j and j in all our formulas by x A 0 , x B 0 , x A 1 and x A 1 , respectively, and replace all sums by integrals.

A. How Gaussianity gives linearity
To make things tractable, we will make one strong but very useful assumption: that x has a Gaussian distribution.The most general d-dimensional multivariate Gaussian distribution is parametrized by its mean vector m ≡ x and covariance matrix T ≡ xx t − mm t and takes the form so we are making the assumption that there is some m and T such that p(x) = g(x; m, T).Let us write m and T as where m i and C i are the mean and covariance of x i , respectively.
Interpreting the sum in the denominator of equation (6) as an integral and evaluating it 3 gives (43) 3 The following well-known matrix identities are useful in the where This encodes the well-known result that the conditional distribution x 1 |x 0 for Gaussian variables is Gaussian with mean m 1 + BC −1 0 (x 0 − m 0 ) and covariance matrix C 1 − B t C −1 0 B. These equations embody a remarkable simplicity that we can exploit.First of all, the covariance matrix Σ is independent of x 0 , which allows us to interpret x 1 as simply a function of x 0 plus a random noise vector n that is independent of x 0 .Second, this function is affine, involving simply a linear term plus a constant.In other words, we can write where the noise vector n satisfies It is worth reflecting on how remarkable this is, since it is easy to overlook.The future state x 1 of a system can depend on the present state x 0 in some arbitrarily complicated non-linear way.Moreover, for a generic Markov process, the scatter of x 1 around its mean will depend strongly on x 0 .Yet as long as all probability distributions are Gaussian, which is often a useful approximation for laboratory data, both of these complications vanish and we are left with the simple linear dynamics of equation (46).

B. Autoregressive processes
Let us now briefly review the formalism of so-called autoregressive processes and how it relates to our problem at hand.A simple special case of the above is where the random process is stationary, i.e., where the statistical properties are independent of time.This implies that m i = m and C i = C for some m and C that are independent of i.For a stationary process, it is convenient to redefine new zero-mean variables x i ≡ x i − m.Dropping the prime for simplicity, this allows us to rewrite equation (46) as derivation of this and other matrix results in this paper: where the noise vectors n i have vanishing mean and vanishing correlations between different times, i.e., n i n t j = δ ij Σ.The covariance matrix between vectors at two subsequent times is therefore Even if the random process is not stationary initially, it will eventually converge to a stationary state where covariance is time-independent as long as all eigenvalues of A have magnitude below unity, so that memory of the past gets exponentially damped over time.Once the covariance has become time-independent, equation (49) implies that C = ACA t + Σ.This is known as the Lyapunov equation, and is readily solved by special-purpose techniques or, rapidly enough, by simply iterating it to convergence.If we write the covariance matrix xx t measured from actual time series data as then equating it with equation (49) lets us compute the matrices we need from the data: These equations hold regardless of whether the probability distributions are Gaussian or not.If the noise n is Gaussian, then all distributions will be Gaussian in the steady state, so this is an alternative way of deriving equations ( 60) and (45) (without the subscripts).
In Section II D, we saw how we can equally well interpret our system as a Markov process operating backward in time, where the future causes the past.Repeating the above derivation for this case, we can write where

C. Optimal factorization
In summary, a Markov process p 1 = Mp can be described much more simply when all probability distributions are Gaussian: instead of keeping track of the infinite-dimensional matrix M or the infinite-dimensional rank-4 tensor p, we merely need to keep track of the finite-dimensional covariance matrix T, from which we can compute and quantify the deterministic and causal parts of the dynamics as the matrices A and Σ, respectively.
Let us now translate the rest of our results from our integration taxonomy into this simpler formalism.To separate out the effects occurring within and between the subsystems A and B, let us name the corresponding blocks of the A-matrix and the matrix T ≡ xx t from equation (39) as follows: Analogously to how equation ( 6) gave us equation (43), equation ( 27) now gives the optimal factorization where In other words, the "o"-factorization approximates x 1 = Ax 0 + n by where the noise vector n has zero mean and covariance matrix We see that tensor factorization in the previous section now corresponds to the matrices A and Σ being blockdiagonal.

D. Noising factorization
Equation (48) tells us that The idea with noising is to take the terms A AB x B 0 and A BA x A 0 and reinterpret them not as signal but as noise, with zero mean and uncorrelated with anything else.The noising option "n" is unfortunately undefined for this continuous-variable case, because it says to use a uniform distribution for these noised versions of x A 0 and x B 0 , which has infinite variance and hence gives, e.g., x B 0 x B 0 t = ∞ when x B 0 is noised.The mild noising option "m", however, remains well-defined, saying to use the actual distributions for these noised versions of x A 0 and x B 0 , hence Computing the first and second moments of equation (64) therefore tells us that "m"-factorization approximates where the noise vector n has zero mean and covariance matrix (66) Note that in contrast to the "o"-factorization of equation (62), the "m"-factorization has no tildes on the A A and A B -matrices in equation ( 65).

E. Results
We now have all the tools we need to derive the Gaussian versions of the φ-formulas in Table II.Starting with equation (30), interpreting the sum in equation ( 26) as an integral and performing it when p is the Gaussian distribution of equation ( 38) gives the well-known formula for the mutual information between two multivariate Gaussian random variables.This immediately gives the five matrix formulas for φ M , φ B , φ otsk , φ ofsk and φ xfkk in the right column of Table II.
Starting with the KL-divergence definition d KL (p, q) ≡ i p i log pi qi from Table V, we again interpret the sum as an integral and use equation (38).This gives the wellknown formula for the KL-divergence between two Gaussian probability distributions f p and f q with means m i and covariance matrices C i (i = p, q), where ∆m ≡ m p − m q .The first term in equation ( 68) thus represents the mismatch between the means and the remainder (which is also guaranteed to be nonnegative) represents the mismatch between the covariances.For φ ofuk , the future distribution p(x 1 ) with mean zero and covarance matrix C is approximated by the distribution q(x 1 ) that has mean zero and covariance matrix AC A t + Σ, which follows from equations (62) and (63).Substituting these means and covariance matrices into equation (68) gives the matrix formula for φ ofuk in Table II.For φ mask , both means again vanish, but now the future distribution p(x A 1 ) has covariance matrix C A while the approximation q(x A 1 ) has covariance matrix , which follows from equations (65) and (66).
For the remaining options in Table II, i.e., φ ofkk , φ oakk , φ opkk , φ makk and φ mpkk , the means do not vanish, since they reflect information about the known state.For φ ofkk , the future distribution p(x 1 ) with mean Ax 0 and covariance matrix Σ is approximated by the distribution q(x 1 ) that has mean Ax 0 and and covariance matrix Σ, so equation (68) gives the matrix formula for φ ofkk in the table.For φ oakk , the future distribution p(x A 1 ) has mean A A x A 0 +A B x B 0 and and covariance matrix Σ, while the approximation q(x A 1 ) has mean A A x A 0 and covariance matrix Σ A .Finally, for φ makk , the future distribution p(x A 1 ) with mean A A x A 0 and covariance matrix Σ A is approximated by q(x A 1 ) with mean ĀA x A 0 and covariance matrix ΣA .The time-reversed measures φ opkk , φ mpsk and φ mpkk are identical to φ oakk , φ mask and φ makk , but with A and Σ replaced by their time-reversed versions Ã and Σ from equation (54).

V. GRAPH-THEORY APPROXIMATION TO MAKE COMPUTATIONS FEASIBLE
A. The problem The φ-formulas for discrete variables in the left column of Table II require working with the n × n matrix M, where n = 2 b for a system of b bits.In other words, the time to evaluate φ for a given cut grows exponentially with the system size b, which becomes computationally prohibitive even for modest system sizes such as 100 bits -let alone the set of neurons in the human brain with b ∼ 10 11 .Even 300 bits give n greater than the number of particles in our universe.
When the system state is described not by bits but continuous variables (such as voltages or other variables measured in fMRI, EEG, MEG or electrophysiology studies), things get even worse, since represending even a single variable requires an infinite number of bits.However, [13] pointed out that the Gaussian approximation radically simplifies things, and we saw in Section IV how φ can then be computed dramatically faster.Not only does the infinity problem go away for most measures in Table II, but the formulas in the right column are exponentially faster to evaluate than those in the left column even when each bit is replaced by a separate real number!This is because if there are b real numbers, the n × n matrix T has n = 2b, not n = 2 b .This means that φ can now be computed in polynomial time, more specifically O(b 3 ) time, since the slowest matrix operations in Table II scale as O(n 3 ).
Unfortunately, even after this exponential speedup in computing φ, computing the upper-case version Φ is still exponentially slow.This is because Φ is the minimum of φ over the exponentially many ways of splitting the system into two parts.Even if we limit ourselves to symmetric bipartitions, there number of ways to split an even number n elements into two parts of size n/2 is where we have used Stirling's approximation n! ≈ √ 2πn(n/e) n .In other words, examining all symmetric bipartitions is pretty much as exponentially painful as examining all 2 n bipartitions, because most bipartitions are close to symmetric.

B. An approximate solution
Being able to compute Φ approximately is clearly better than not being able to compute it at all.In this spirit, let us explore an approximation that exponentially accelerates the computation of Φ. Starting with the linear dynamics x i+1 = Ax + n from equation ( 48), let us motivate our approximation by considering the case where the noise is n uncorrelated (where Σ is diagonal) so that it introduces no correlations between the two systems, regardless of the cut.This means that the only source of integration can be the A-matrix transferring information between the two subsystems.Let us visualize this information flow as a directed graph (Figure 3, bottom), where each node represents a variable i and each edge represents a non-zero element A ij , i.e., non-zero information flow from element j to element i.If this graph consists of two disconnected parts A and B of equal size, as in the lower right corner of Figure 3, then we clearly have Φ = 0, since there is no information flow and hence no integration between these two parts.In other words, if we permute the elements so that all elements of A precede all elements of B, the matrix A becomes block-diagonal (Figure 3, middle right), for which all integration measures in the right column of Table II will give φ = 0.
Note that before the elements were permuted (Figure 3, top right), this fact that φ = 0 was less obvious.
Moreover, examining all n! permutations (or all n n/2 symmetric bipartitions) would have been an enormously inefficient way of finding that best bipartition for which φ vanishes.In contrast, finding the connected components of a graph is quite simple, as is evident from staring at Figure 3, with complexity between O(n) and O(n 2 ).This means that if we know that Φ = 0, then we can find the best bipartition ("cruelest cut") easily, in polynomial time.
Let us now define an approximation taking advantage of this idea: replace all unimportant elements |A ij | < by zero, and adjust so that the largest connected component has size as close as possible to n/2.Letting this largest connected component define our approximation of the best bipartition, we now compute its φ-value and use this as our approximation for Φ. showing all non-zero matrix elements.Our method zeros all matrix elements |Aij| < below the threshold that makes the largest connected graph component involve merely half of the elements, which in the matrix picture means that there is a permutation of the elements (rows and columns) rendering the matrix block-diagonal (middle right).Whereas it would take exponentially long to try all matrix permutations, graph connectivity can be determined in polynomial time, thus enabling us to rapidly find a good approximation for the "cruelest cut" bipartition.
Note that this approximation can be trivially generalized to asymmetric bipartitions.In practice, we determine by using the interval halving method.A final technical point is that we have two separate definitions of graph connectivity to choose between: weak and strong.A graph is strongly connected if you can move between any pair of elements following the directional arrows on the edges.This means that every element can (at least through intermediaries) affect and be affected by every other element, precisely capturing the integration spirit of [10].Strong connectivity is therefore the logical choice when using our approximation to compute Φ 2.0 , Φ 2.0 , Φ 2.0 , since it will reflect their property that integration vanishes for afferent and efferent pathways.A graph is weakly connected if you can move between any pair of elements ignoring edge arrows -in other words, if it simply looks connected when drawn.Using weak connectivity is arguably the better approximation for the Φ-measures that do not vanish for afferent/efferent pathways, and numerical experiments confirm this.Whereas it is seen to be excellent at finding the best bipartition when not all are comparably good, (i.e., when Φmax/Φin 1), the approximation is seen to overestimate Φ by up to 15% (the median) when there is no clear winner (left side).From top to bottom, the three curves show the 95th, 50th and 5th percentiles of the overestimation factor.The shaded region delimits the largest overestimation possible, when Φappox = Φmax.
Figure 4 illustrates the accuracy of our approximation.For this example, we randomly 4 generate 7,000 different 4 We generate A-matrices by first computing where A 0 , A 1 and A 2 are random matrices (whose elements are independent Gaussian random variables with zero mean), each 16 × 16 matrices A and compute Φ M both exactly (as the minimum of φ M over all 16   8   = 12, 870 symmetric bipartitions) and using our approximation.
For comparison, we also compute the maximum Φ M max over the bipartitions.The ratio Φ max /Φ ≥ 1 (where Φ ≡ Φ min ) quantifies how relatively decomposable a system is, whereas the ratio Φ approx /Φ ≥ 1 quantifies how well our approximation works, with a value of unity signifying that it is perfect and found the optimal bipartition.Figure 4 plots these two quantities against each other, and reveals that they are strongly related.For fairly separable systems, the approximation tends to be excellent: it gives exactly the correct answer 95% of the time when Φ max /Φ > 2 and 99.96% of the time when Φ max /Φ > 3.
When Φ max /Φ ∼ < 2, on the other hand, so that there is less of a clear winner among the bipartitions, our approximation is seen to overestimate the true Φ-value by up to 15% on average (this is the median).

VI. CONCLUSIONS
Motivated by the growing interest in measuring integrated information Φ in computational and cognitive systems, we have presented a simple taxonomy of Φmeasures where they are each characterized by their choice of factorization method (5 options), choice of probability distributions to compare (3×4 options) and choice of measure for comparing probability distributions (5 options).We classify all the integration measures revealed in this taxonomy by various desirable properties, as summarized in Table I.When requiring the Φ-measures to satisfy a minimum of attractive properties, the hundreds of options reduce to a mere handful, some of which turn out to be identical.All leading contenders are summarized in Table II.
Unfortunately, these most general integration measures are unfeasible to evaluate in practice, with the computational cost growing doubly exponentially with b, the number of bits in the system: they involve a Markov matrix of size n = 2 b , and they also involve minimizing over approximately N = 2 n = 2 2 b bipartitions.Generalizing the pioneering work of [13], we derive formulas for the Gaussian case that are exponentially faster, involving manipulations of a matrix whose size grows as 2b rather than 2 b with the number of variables b.Moreover, we show how the second exponential can also be avoided using an approximation using graph theory, thus reducnormalized to have their largest eigenvalue equal to unity.We then renormalize A so that its largest eigenvalue equals 0.99.The parameter η controls the typical level of integration: η = 0 gives Φ = 0 since A is block-diagonal, whereas η → ∞ gives minimal integration, with no special cut put in by hand; η is randomly chosen to be 0.1, 0.3, 0.5, 0.7, 1, 2 or 10 with equal probability.Once we have generated A, we compute C as the solution to the Lyapunov equation C = ACA t + Σ with Σ = I.
ing the computational cost from doubly exponential to merely polynomial in the system size b.
A. Which Φ-measures are best?
As described in detail in Section III F, six Φ-measures stand out from the taxonomy of hundreds of measures as particularly attractive: Φ M , Φ M kk , Φ 3.0 , Φ 2.0 , Φ 2.0 and Φ 2.0 .Φ M retains all the attractive features of the Barrett/Seth measure Φ B and adds further improvements: it is guaranteed to vanish for separable systems and to never be negative.If state-dependence is viewed as desirable, then its cousin Φ M kk adds that feature too.Φ 3.0 is the measure advocated by IIT3.0 and has the many attractive features described in [11].It has the drawback of being the slowest of all the measures to evaluate numerically: its definition involves a linear programming problem which needs to be solved numerically, and even with the fastest algorithms currently available, the computation for a given bipartition grows faster than quadratically with the number of system states -which in turn grows exponentially with the number of bits, and is infinite for continuous variables.
The remaining three top measures, Φ 2.0 , Φ 2.0 and Φ 2.0 , share with Φ 3.0 the arguably desirable feature of vanishing for afferent and efferent systems, but are much quicker to compute.Φ 2.0 is the measure advocated by IIT2.0 [10] and elegantly depends only on the system's dynamics and present state, not on any assumptions about which states are more probable.Its drawback of being infinite for continuous variables is overcome by its cousin Φ 2.0 .
A potential philosophical objection to both Φ 2.0 and Φ 2.0 is that they are arguably not measures of integration, but measures of how suboptimal the factorizations "n" and "m" are, since they would both vanish if an optimal factorization were used -the measure Φ 2.0 eliminates this concern.

B. Outlook
Although the results in this paper will hopefully prove useful, there is ample worthwhile work left to do on integration measures.
One major open question is how to best handle asymmetric partitions.We deliberately sidestepped this chal-lenge in the present paper, since it is independent of our results, which is why the subtle normalization issue raised by [10], [13] and [11] never entered.The crux is that if we apply any of the measures in our taxonomy with an asymmetric bipartition, the resulting φ-value will tend to get small when any of the two subsystems is very small, so simply defining Φ as the minimum of φ over all bipartitions (symmetric or not) makes no sense.IIT3.0 makes an interesting proposal for how to handle asymmetric partitions, and it is worthwhile exploring whether there are other atttractive options as well.
Another foundational question is whether our taxonomy can be placed on a firmer logical footing.Although our classification based on factorization, comparison, conditioning and measure may seem sensible an exhaustive, it is interesting to consider whether one or several Φ-measures can be rigorously derived from a small set of attractive axioms alone, in the same spirit as Claude Shannon derived his famous entropy formula, equation (26).
A more practical question involves exploring ways of generalizing and further improving our graph-theorybased approximation for exponential speedup.One obvious generalization would involve taking into advantage of the structure of Σ (which our method ignored) and the effect of x (for those Φ-measures that are statedependent).Another interesting opportunity is to generalize from continuous Gaussian systems to arbitrary discrete systems.For example, if the system consists of b different bits coupled by a nonlinear network of gates, one can apply a similar graph-theory approach by defining a b × b coupling matrix A ij that in some way quantifies how strongly flipping the j th bit would affects the i th bit at the next timestep.
Last but not least, a veritable goldmine of data is becoming available in neuroscience and other fields, and it will be fascinating to measure Φ for these emerging data sets.In particular, the exponentially faster Φ-measures we have proposed will hopefully facilitate quantitative tests of theories of consciousness.

FIG. 3 :
FIG.3: Illustration of our fast Φ-approximation for an n = 16 example.The structure of the A-matrix can be visualized either as a grid (top four examples) where each pixel color shows the value of the corresponding element Aij ranging from the smallest (black) to the largest (white), or as a graph (bottom examples) showing all non-zero matrix elements.Our method zeros all matrix elements |Aij| < below the threshold that makes the largest connected graph component involve merely half of the elements, which in the matrix picture means that there is a permutation of the elements (rows and columns) rendering the matrix block-diagonal (middle right).Whereas it would take exponentially long to try all matrix permutations, graph connectivity can be determined in polynomial time, thus enabling us to rapidly find a good approximation for the "cruelest cut" bipartition.

1 FIG. 4 :
FIG.4: How well our fast Φ-approximation works for 7,000 simulations of the n=16 Φ M -example described in the text.Whereas it is seen to be excellent at finding the best bipartition when not all are comparably good, (i.e., when Φmax/Φin 1), the approximation is seen to overestimate Φ by up to 15% (the median) when there is no clear winner (left side).From top to bottom, the three curves show the 95th, 50th and 5th percentiles of the overestimation factor.The shaded region delimits the largest overestimation possible, when Φappox = Φmax.

TABLE I :
Properties of different integration measures.All but the third are desirable properties; capitalized N/Y (no/yes) Table IV specifies the options for how p and q are computed and how q