Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Decomposing past and future: Integrated information decomposition based on shared probability mass exclusions

  • Thomas F. Varley

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Department of Psychological and Brain Sciences, Indiana University Bloomington, Bloomington, IN, United States of America, School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, IN, United States of America


A core feature of complex systems is that the interactions between elements in the present causally constrain their own futures, and the futures of other elements as the system evolves through time. To fully model all of these interactions (between elements, as well as ensembles of elements), it is possible to decompose the total information flowing from past to future into a set of non-overlapping temporal interactions that describe all the different modes by which information can be stored, transferred, or modified. To achieve this, I propose a novel information-theoretic measure of temporal dependency (Iτsx) based on the logic of local probability mass exclusions. This integrated information decomposition can reveal emergent and higher-order interactions within the dynamics of a system, as well as refining existing measures. To demonstrate the utility of this framework, I apply the decomposition to spontaneous spiking activity recorded from dissociated neural cultures of rat cerebral cortex to show how different modes of information processing are distributed over the system. Furthermore, being a localizable analysis, Iτsx can provide insight into the computational structure of single moments. I explore the time-resolved computational structure of neuronal avalanches and find that different types of information atoms have distinct profiles over the course of an avalanche, with the majority of non-trivial information dynamics happening before the first half of the cascade is completed. These analyses allow us to move beyond the historical focus on single measures of dependency such as information transfer or information integration, and explore a panoply of different relationships between elements (and groups of elements) in complex systems.

1 Introduction

What does it mean for a complex system to have “structure,” or even to be a “system” at all? Nature abounds with systems: almost every object, when examined closely enough, is actually a composite structure, comprised of many interacting components. The world is a dynamic congeries of complex interactions and relationships. It is those relationships that define the nature and structure of the systems of which they are a part. For a system to have “structure,” its behaviour in the future must be some consequence of its behaviour in the past. When parts of the system interact, the states of individual elements, or ensembles of elements, constrain their own possible futures, the futures of those components they interact with, and ultimately, the future of the system as a whole. For example, a single neuron embedded in a neuronal network might fire at some time tτ: that firing, constrains its own future (albeit transiently) due to subsequent hyper-polarization and the refractory period. It also informs on the possible futures of all those post-synaptic neurons to which it was coupled: the probability that they will fire changes after one of their parents fires and so on. In particular cases, the firing of a single neuron (or just a few neurons) may radically constrain the future of the entire brain (for example, if it triggers an epileptic seizure).

The entire scientific endeavour is, in some sense, built on uncovering these dependencies and understanding their specifics. For a complex system X, comprised of many interacting parts, it is possible to quantify the total degree to which its future can be predicted based on its past with the excess entropy [1]: (1)

Where X−∞:t corresponds to the joint state of every element in X, at every time t, from the first moment up to time t. The second term, Xt:∞, indicates the joint state of every element at every time from t to the infinite future (I adopt the Python-like notation from [2]). Accounting for extended periods of past and future can reveal dependencies of varying durations (e.g. distance-related delays in communication networks), however, in practice, there are practical problems associated with recording infinite data, so the full excess entropy is typically inaccessible. In the particular case of Markovian systems, however, the situation is considerably easier, as the excess entropy reduces to the mutual information between a moment and its immediate past (possibly incorporating a lag of −τ moments): (2)

For example, consider a two element system with Markovian dynamics: X = {X1, X2} (following [3] I use superscripts to denote indexes and subscripts to denote time). We can compute the lag-τ excess entropy of X as a whole as: (3)

The excess entropy is an extremely coarse measure, aggregating all of the temporal statistical dependencies, at every scale, within a multivariate system into a single number. For a more “complete” understanding of the dependencies within a system, it would be useful to be able to decompose it into non-overlapping components that describe how particular elements (and ensembles of elements) constrain each other as the system evolves through time: for example, how does the state of X1 at time tτ constrain its own future? How does it constrain the future of X2? Other, more exotic dependencies are also possible: for example, the joint state of and together may constrain the future of just (a phenomena sometimes referred to as “downward causation”, which has been the subject of intense philosophical debate [4, 5]). There may be information about the future of that is redundantly disclosed by observing either alone or alone, and so on. How can all of these different dependencies be untangled?

One possible path forward comes from the field of information decomposition. Classically, information decomposition concerns itself with the question of how to best understand how different ensembles of predictor variables collectively disclose information about a single target variable [6, 7]. Since the original introduction of the partial information decomposition (PID) framework by Williams and Beer in 2010, researchers in complex systems science, information theory, and theoretical neuroscience have collectively worked to deepen our understanding of multivariate information and higher-order dependencies. Recently, Mediano, Rosas, and other introduced a multi-target information decomposition (the integrated information decomposition (ΦID) [3, 8, 9] which extends the original PID framework to multiple targets, enabling a full decomposition of the excess entropy into non-overlapping, “atomic” components (which I will refer to as ΦI, or integrated information, atoms). Despite being a considerable leap forward in our understanding of multivariate, temporal information, like the original PID, the ΦID lacks a crucial element required for applications to real data: an operational definition of multivariate redundancy.

In this work, I propose such a redundancy function, termed Iτsx. Based on a recent single-target measure introduced by Makkeh et al., [10] our proposed measure generalizes the classic Shannon mutual information function to ensembles of multiple interacting elements that may redundantly disclose information about each-other. I begin by reviewing the classic, single-target PID, before generalizing to the ΦID. I then introduce the Iτsx measure, and demonstrate its application in three constructed Markovian systems designed to display distinct dynamical differences, and finally empirical, neuronal spiking data recorded from dissociated cultures of mouse hippocampal cortex [11, 12]. I conclude by discussing the strengths and limitations of our measure, and the ΦID framework itself.

1.1 Partial information decomposition

1.1.1 Intuition & the bivariate case.

Consider the the simple case with two predictor variables (X1 and X2) that jointly disclose information about a target variable Y. Basic information theory gives us the tools to asses how each Xi individually informs on Y (the marginal mutual informations, e.g. I(Xi; Y)), and how the joint state of both X1 and X2 together inform on Y: I(X1, X2; Y). The relationship between the marginal and joint mutual informations is not always straight-forward, however: the sum of both marginal mutual informations can be greater than or less than the joint mutual information in various contexts. If I(X1;Y) + I(X2; Y) > I(X1, X2; Y), then there must be some information about Y that is redundantly present in both X1 and X2 individually, and so when the two marginal mutual informations are summed, that redundant information is “double counted.” Conversely, if I(X1; Y) + I(X2; Y) < I(X1, X2; Y), then there is information about Y in the joint state of X1 and X2 that is only accessible when the two are considered together and not accessible by looking at any individual X. These comparisons of “wholes” to “parts” are only rough heuristics, however, as redundant and synergistic information can co-exist in a set of predictor variables [6]: the direction of the inequality only indicates whether synergistic or redundant information dominates the interaction.

The seminal contribution of Williams and Beer was to provide a mathematical framework that allowed for a complete decomposition of the joint mutual information into non-overlapping, additive “atoms” of information: (4) where Red(X1, X2; Y) is the redundant information about Y that could be learned by observing either X1 or X2 individually, Unq(X1; Y/X2) is the information about Y that is uniquely disclosed by X1 (in the context of X2, a vice versa for the other unique atom), and Syn(X1, X2; Y) is the synergistic information about Y that can only be learned by observing X1 and X2 simultaneously. Furthermore, the “marginal mutual informations” can be broken down into the same atomic components: (5)

The result (in the case of two predictor variables) is an under-determined system with three known values (the three mutual information terms) and four unknown values (each of the partial-information atoms). If any one atom can be determined, then the remaining three are resolved “for free.” Classical information theory does not provide any specific functions for any of these terms [13], and consequently their development is an area of active, and on-going, research. It is most common to begin by defining a redundancy function [6], although approaches based on defining unique [14, 15] and synergistic information [16, 17] have also been proposed. Unfortunately, if the number of sources is greater than two, the resulting decompositions of the joint and marginal mutual informations are not so constrained and more advanced mathematical machinery is required to decompose the joint mutual information.

1.1.2 The partial information lattice & möbius inversion.

For a collection of N predictor variables X = {X1, …, XN} jointly informing on a single target Y, we are interested in understanding how every XiX (and ensembles of Xs joint by the logical conjunction) disclose information about the target. This requires understanding all the ways that the elements of X can redundantly, uniquely, and synergistically share information. Williams and Beer showed that, given an measure of redundant (or shared) information between some collection of sources and the target (denoted I(⋅; Y) here), the “atomic” components of the joint mutual information are constrained into a partially ordered set called the partial information lattice. The derivation of the lattice will be briefly described below, but see Gutknecht et al., for a more complete discussion [7].

We begin by defining the set of sources that may disclose information about Y. This is given by the set of all subsets of X (excluding the empty set, denoted as ). Every (potentially multivariate) source can be thought of as an aggregated macro-variable, whose state is defined by the logical-AND operator over all of its constituent elements. For example, if our predictor variables are X1, X2 and X3, then the collections of sources are: (6)

For some (potentially overlapping) collection of sources, A1, …, Ak, the redundancy function I(A1, …, Ak; Y) quantifies the information about Y that can be learned by observing A1 ∨ … ∨Ak. The domain of the I(⋅; Y) is given by the set of all collections of sources such that no source is a subset of any other: (7)

This restriction means that that is also partially ordered: (8)

The resulting lattice provides the scaffolding on which the full PID may be constructed. Every corresponds to a vertex on the lattice, and the ordering reveals a structure of increasingly synergistic information-sharing relationships. For a visualization of the partial information lattices for sets of two and three predictor variables, see Fig 1.

Fig 1. Single target partial information lattices.

Examples of partial information lattices for the two simplest possible systems of multiple sources predicting a single target. On the left is the lattice for two predictor variables, and on the right is the lattice for three predictor variables. Following the notation introduced by Williams and Beer [6], sources are denoted just by index: for example {1}{2} is the information redundantly disclosed by X1 or X2, {1}{23} is the information disclosed by X1 or (X2 and X3), etc.

With the structure of the partial information lattice set and our as-yet-undefined redundancy function in place I(⋅; Y), it is possible to solve the PID for every using a Möbius inversion: (9)

By recursively defining the value of particular partial information atoms as the difference between the redundant information disclosed by a particular set of sources and the sum of all atoms lower on the lattice, the joint mutual information between an arbitrary number of predictor variables and a single target can be decomposed into non-overlapping components. (10)

1.2 Integrated information decomposition

With the basic PID defined, it is possible to do a partial examination of the excess entropy. For example, Varley and Hoel [18] decomposed the joint mutual information between all elements at time tτ and the joint state of the whole system at time t: . This method provides insights into how the states of particular elements (and ensembles of elements) collectively constrain the future of the whole system, but provides limited insights into how parts of the system constrain each-other, as the future state is aggregated into a single “whole.” To partially address this issue, one could imagine doing a PID of the information that every element of the whole system at time tτ discloses every set of elements at time t: decomposing , , , etc. While potentially illuminating, this approach is still limited by the fact that it does not readily allow for notions of “redundancy” and “synergy” between target elements. For example it might be natural to ask “what information synergistically disclosed by and about also applies to (i.e. is redundantly “copied” over both elements). Similarly, when decomposing , one might want to know what information about the joint state of and could not be learned by decomposing or alone (i.e. that information that is synergistically present in the joint state of and together). Achieving a complete decomposition of the excess entropy requires a generalization of the PID framework to account for redundancies and synergies in both the past and future of the system under study.

To address this, Mediano et al., [3, 8, 9, 19] recently introduced a generalization of the PID that allows the decomposition of multiple sources onto multiple targets. Called the integrated information decomposition (ΦID), this decomposition allows for a complete decomposition of the excess entropy.

The integrated information decomposition begins by defining a product lattice (where is the single-target redundancy lattice derived above), for which each vertex in is defined by an ordered pair αβ, with . In the case of a temporal process, α refers to a particular collection of sources observed at time tτ that disclose information about β, a collection of sources observed at time t.

As with the single-target partial information lattice, the product lattice is a partially ordered set, with: (11)

The integrated information lattice can be similarly solved via Möbius inversion, given a suitable temporal redundancy function . For a visualization of the integrated information lattice for the case of two sources and two targets, see Fig 2.

Fig 2. The integration information lattice.

The integrated information lattice for a system X = {X1, X2}. Every vertex of the lattice corresponds to a specific “conversion of information” that information in one mode at time tτ can be transformed into at time t. For example, {1}{2} → {1} corresponds to information that is redundantly disclosed by X1 and X2 at time tτ that is then only uniquely disclosed by X1 at time t.

The ΦID framework deviates from the PID framework in one key way. In the original formulation by Williams and Beer, the lattice (which motivates the Möbius inversion) is derived from the axiomatic properties of the proposed redundancy measure. While there has never been universal agreement on the specific definition of “redundancy”, any function that satisfies the original axioms can be shown to induce the lattice: it follows from the definition of redundancy. In contrast, in the ΦID framework, the double-redundancy lattice is not derived from the properties of the function, but rather, imposed by the product of the “marginal” PI lattices. To address this, Mediano et al., imposed a compatibility constraint on any double-redundancy function [3, 8]. Given two (potentially, but not necessarily) multivariate) variables X, Y and two double-redundancy atoms

The compatibility axiom requires that, if one of the variables (X or Y) is univariate, then the double redundancy function reduces to a classic, single-target redundancy function, and the ΦID reduces to the classic PID.

Mediano et al., also impose a partial-ordering criteria: if αβα′ → β′, then . This ensures that the redundancy function induces the same partial ordering on atoms that the construction of the product lattice does, ensuring consistency between the scaffold and the function.

1.2.1 Interpreting ΦID atoms.

The standard PID atoms are reasonably easy to interpret in terms of logical conjuctions and disjunctions of sources. In the case of the ΦID, the left-hand side of the integrated information atom remains the same (collections of sources that redundantly disclose information), but there is no longer a consistent target. Rather, there are again collections of sources that have their own redundant information sharing patterns. What, then, are they disclosing information about? I will discuss the answer in formal detail below, however, one proposed intuition is in the form of information dynamics. Information dynamics proposes to break the different “modes” of information flow in complex systems down into discrete “types of computation” or “processing” [20]. Mediano et al. [3, 8], proposed the following intuitive taxonomy of integrated information atoms on the two-element lattice:

  1. Information Storage: Information present in a particular configuration at time tτ that remains in the same configuration at time t. In the case of the two-element system, these are: {1}{2} → {1}{2}, {1} → {1}, {2} → {2}, and {12} → {12}.
  2. Causal Decoupling: The double-synergy term {12} → {12} has been given particular focus as a possible formal definition of “emergent dynamics” [9, 19, 21], as it refers to information that is persistently present in the whole, but none of the parts.
  3. Information Transfer: Information present in a single element that “moves” to another single element: {1} → {2} and {2} → {1}. Not to be confused with the transfer entropy [22], which typically involves extended histories and itself conflates unique and synergistic modes of information sharing [23].
  4. Information Erasure: Information that is initially present redundantly over multiple elements that is erased from one of the two: {1}{2} → {1} and {1}{2} → {2}.
  5. Information Copying: Information that is initially present only a single element that is “duplicated” to be redundantly present in multiple elements. {1} → {1}{2} and {2} → {1}{2}.
  6. “Upward Causation”: A somewhat less well-defined idea: when the state of single elements constrains the future state of the entire ensemble. {1}{2} → {12}, {1} → {1}{2}, and {2} → {1}{2}.
  7. “Downward Causation”: A philosophically controversial concept, downward causation occurs when the synergistic joint state of the “whole” constrains the future of the individual parts. {12} → {1}{2}, {12} → {1}, and {12} → {2}

This taxonomy has only begun to be explored (for example see [2426] for intriguing results related to macro-scale brain dynamics), and a rigorous formal understanding of the relevant mathematics may help deepen our understanding of these various (and in some cases, philosophically significant) phenomena.

2 Shared exclusions & (temporal) redundancy

A peculiar quirk of the PID and its derivatives is that, while it reveals the “structure” of multivariate information, it doesn’t provide a direct means of calculating the specific values: it assumes the existence of a well-behaved redundancy measure and builds from there. Since the initial introduction by Williams and Beer, the number of different redundancy functions has proliferated (see [10, 13, 2736]), although to date, no measure has achieved universal acceptance or satisfies every desiderata.

Being much newer, there has been less work on double redundancy functions: to date, only three have been used, most only once: a temporal minimum mutual information analysis [24, 26, 37], a measure based on the dependency lattice [9], and a generalization of the common change in surprisal measure [3]. While all analyses are informative, there is still room for deeper insights into the exact nature of temporal redundancy and how information conversion occurs between ensembles of variables. In this work, I generalize a recent redundancy function, the Isx measure first proposed by Makkeh et al., [10], to account for multiple targets which I term Iτsx. I selected Isx as my starting point for three reasons: the first is that it illuminates an elegant connection between multivariate information sharing and formal logic, second, because it does not require arbitrary thresholds (as in the case of Iccs [33]) nor non-diffentiable min/max functions (as in Immi [31] and the closely related I± [34]). Third, it is localizable, returning values for every possible configuration, rather than being an expected value over the entire distribution. Below, I introduce the basics of local information theory (a key prerequisite for defining Isx), before defining the redundancy function for single targets, and ultimately generalizing to multi-target information.

2.1 Local information theory

Thus far, I have been using the standard interpretation of mutual information as an average value over some distribution of configurations; (12)

For any specific configuration, the local mutual information is defined as: (13)

Unlike the expected mutual information, the local mutual information can be either positive or negative depending on whether P(x|y) or P(x) is the greater term. While the local mutual information is well-explored and has been previously used extensively to characterize “computation” in complex systems [20], it is only recently that a novel interpretive framework has emerged based on exclusions of probability mass. Finn and Lizier [38] showed that the sign and value of the local mutual information i(x; y) can be understood as a function of the amount of probability mass from P(X, Y) that is “ruled out” upon observing that X = x and Y = y. For a very simple example, consider a system where one player rolls a fair die and another has to guess the value. Initially, the guesser is maximally uncertain, as all six outcomes are equiprobable. However, if they learn that the the number rolled was even, then they have gained information proportional to the total probability mass of all excluded possible outcomes. Formally, the local mutual information can be re-written in terms of probability mass exclusions as: (14)

In this relationship, if y is comparatively more likely after accounting for x, then i(x; y) > 0, and if it is less likely, then the value is negative.

2.2 Single-target redundancy based on shared exclusions (Isx)

The classic, local mutual information is bivariate, quantifying the information shared between two variables. To construct a function that accounted for multiple sources redundantly disclosing information about a single target, Makkeh et al., leveraged a link between redundant information and logical implication [7, 10]. Briefly, given some set of logical statements ψ1, …, ψk, the information that is redundantly disclosed by all of them is the information learned if ψ1 = True OR ψ2 = True OR … ψk = True. From this, they define a logical redundancy measure that induces the same lattice as the PID. The function provides a mapping between every and a logical statement. For example, the atom {1}{2}{3} maps to the information disclosed if ψ1ψ2ψ3 = True, and the atom {1}{2, 3} maps to the statement ψ1 ∨ (ψ2ψ3) = True and so on. The application to random variables is straightforward: given some set of source variables informing on a target y, the information about y redundantly disclosed by all the sources is the information that could be learned by observing x1 alone OR x2 alone … OR xk alone. The logic extends to more complicated atoms, such as {1}{2, 3}, which is the information about y that would be learned by observing just x1 alone OR the joint state of x2 AND x3 together.

As in the case of local mutual information, isx defines “disclosing information” in terms of probability mass exclusions. For example, observing X1 = x1 ∨ (X2 = x2X3 = x3) changes the probability of observing Y = y. As with the local mutual information, depending on how P(y) changes, the value of isx can be positive or negative.

More formally, consider a set of (potentially overlapping, potentially multivariate) sources a1, …, ak that collectively disclose information about a target y. The information redundantly shared between them can be defined as a function of the probability mass of P(Y) that would excluded by observing a1 ∨ … ∨ ak: (15)

For the special case of only one source, it is clear that isx(a; y) = i(a; y), which is itself just a regular joint mutual information: . In this sense, isx can be understood as generalizing the Shannon mutual information to account for ensembles of multiple sources that redundantly share information about y [7].

Like the standard local mutual information isx can return both positive and negative values (corresponding to informative and misinformative probability mass exclusions respectively). These two types of exclusion can be quantified by further decomposing isx into two components: (16) (17) (18)

In the context of a single-target PID, and are provably non-negative and satisfy the original desiderata proposed by Williams and Beer. The local redundant information measures can be aggregated into expected measures over the distribution of configurations in the same way as mutual information: (19) and likewise for the informative and misinformative functions.

2.3 Multi-target temporal redundancy based on shared exclusions (Iτsx)

We now have all the required machinery to introduce our local measure of temporal information decomposition: Iτsx. In the original isx measure, the mutual information is understood as the relative increase or decrease in the probability P(Y = y) after observing the configuration of some ensemble of sources. In Iτsx, the probability of the single target is replaced with the probability of observing b1 ∨ … ∨ bm.

It is worth considering the intuition behind this change. Suppose x = {x1, x2} and y = {y1, y2}. I am interested in what probability mass exclusions induced by x1 OR x2 are consistent with either y1 OR y2. Said differently, what information that could be learned from either x1 OR x2 (i.e. is redundantly present in both of them) is true about any configuration consistent with y1 OR y2.

Formally: (20)

From here forward, I will denote ensembles of sources with α, β, etc, for the purposes of notational compactness.

From the definition of iτsx it is clear that it follows the compatibility criteria proposed by Mediano et al., [3, 8]. When decomposing i(x, y) and |y| = 1, then Eq 20 is equivalent to Eq 15 as the union of all sources (b1 ∪ … ∪ bm) is equivalent to the single source y, and likewise for the condition where |x| = 1. This shows that iτsx is consistent with the classic PID (inducing the standard single-target redundancy lattice). In the special case of single sources (α = {a}, β = {b}), it is clear that Iτsx(a; b) = i(a; b) and so that Iτsx completes the generalization of local mutual information begun by isx: Iτsx is a full generalization of the mutual information to multiple sets of redundant sources and multiple sets of redundant targets.

2.3.1 Multi-target redundancy & entropy decomposition.

The full function does not satisfy the partial ordering (monotonicity) criteria. Like isx, it can be negative or positive, depending on the structure of the dependency between the elements. The double-redundancy can be decomposed, however, into three redundant entropy terms that are partially ordered, and consistent with the integrated information lattice. These three redundant entropy terms induce three partial entropy decompositions [3941]: two marginal decompositions on the classic redundancy lattice, and a joint decomposition on the product lattice.

The double redundancy function can be re-written in terms of sums and differences of union entropies. I can re-write Eq 20 in an equivalent form: (21)

For proof of equivalence, see Appendix A in S1 Appendix.

For some multivariate x, recall that i(x; x) = h(x). The joint entropy h(x) can be decomposed by assessing how different combinations of parts (i.e. all xix) redundantly and synergistically disclose information about the whole [10, 41]. The single-target isx function can decompose i(x1, …, xk; x) and will find that it is equal to i+(x1, …, xk; x). There is no misinformative component (for proof, see Appendix B in S1 Appendix).

Intuitively, the redundant entropy function can be understood as quantifying how much uncertainty about x is resolved by learning x1 ∨ … ∨xk. This function was recently explored in detail by Varley et al., [41] and denoted as hsx after [10]: (22)

Which has been previously shown to satisfy the relevant Williams and Beer axioms locally [10]. I can then re-write Eq 20 as: (23)

This framing provides a different, but complementary perspective on iτsx. The first term, hsx(α) quantifies how uncertainty about the global joint state (x,y) is resolved by learning past states a1 OR …ak etc. Similarly, hsx(β) quantifies the uncertainty about (x,y) resolved by learning future states b1 OR…bm.

The final term, hsx(αβ) is a little bit less straightforward, and reflects the structure of the double redundancy lattice and satisfies the required partial ordering imposed by Mediano et al., [8]. Makkeh et al., [10] showed that, if αα′ on the marginal (classic) redundancy lattice, then the set of configurations consistent with α′ is a subset of those configurations consistent with α. This ensures that hsx(α) ≤ hsx(α′), and likewise for β. The double redundant entropy term hsx(αβ) quantifies the probability of the intersection of the configurations consistent with α AND β. If α′ ⊆ α and β′ ⊆ β, then α′ ∩ β′ ⊆ αβ and consequently, hsx(αβ) ≤ hsx(α′ ∩ β′). For a worked example, see Appendix C in S1 Appendix. Note that, while hsx(αβ) > 0, it is not necessarily true that the associated partial entropy atoms are non-negative following the Mobius inversion.

From Eq 23 I can also construct the informative and misinformative probability mass exclusion formulation of iτsx equivalent to the informative and misinformative components of isx, however unlike in the single-target case, does not follow the partial ordering criteria that hsx does and is also not strictly non-negative.

2.3.2 Interpreting Iτsx in the ΦID.

Our analysis of iτsx and the decomposition into hsx(α) + hsx(β) − hsx(αβ) has, thus far, been general, and could apply to any multivariate mutual information i(x; y). There is no temporal dynamic assumed. When decomposing i(xτ; xt), the partial entropy terms can be understood as parcelling out the information contained in the instantaneous structure at time tτ and time t, and the information in the dynamics: the transition from past the future. Each of the marginal partial entropy decompositions provides the entire, instantaneous structure at that moment. For example, when decomposing h(xτ) it is possible to extract all of the dependencies between the elements of xτ (i.e the mutual informations , the conditional mutual information and so on [39, 41]), and likewise for xt. The sum, then of hsx(ατ) + hsx(βt) quantifies the total amount of information that could be learned about the transition xτxt without making any reference to the temporally extended dynamics of the system (note that if time were reversed, the sum of the entropy terms would be the same). It is the total “static” structure.

The dynamic structure is encoded in the last term hsx(ατβt), which is the only term that incorporates information from the state-transition structure. This can help interpret those cases when iτsx(ατβt) < 0 bit. Negativity occurs when hsx(ατ) + hsx(βt) < hsx(ατβt). In plain language, this occurs when there is more information in the structure of the transition from xτxt then there is the instantaneous structures at time tτ and t.

3 Results

In this paper, I have proposed a novel function of multi-target redundancy to be used as the foundation of an integrated information decomposition [3, 8]. Based on the logic of information as exclusions of possible configurations [38], our proposed measure, Iτsx, generalizes the single-target redundancy measure first proposed by Makkeh et al., to enable the full decomposition of the excess entropy intrinsic to discrete dynamical processes. To demonstrate the measure in action, in the context of the ΦID, I will now explore some applications: the first three will be constructed systems designed to display markedly different dynamics (disintegrated, integrated, and heterogeneous) to illustrate how different “types” of integration can be revealed by the decomposition. I will then examine spiking data from dissociated cultures made from rat brain tissue to demonstrate the insights that can be gained from both the expected, and local, integrated information decompositions.

3.1 Synthetic systems

Each of the three synthetic system is comprised of two, binary, elements that evolve through times according to different Markovian state-transition networks (visualized in Fig 3). Prior work on such simple, Boolean networks has shown that the space of even very small systems has a surprisingly rich distributions of redundant, unique, and synergistic effective information atoms [18]. Despite the simplicity of the synthetic systems under study here, they showcase how Iτsx can reveal markedly different dynamic regimes. These systems were designed to show the two extreme behaviours of ΦWMS: the first system is totally dis-integrated and ΦWMS = 0 bit (as the future of the whole can be perfectly predicted from the independent parts). The second system is completely integrated: the sum of the excess entropy of the parts is 0 bit, while the whole is has non-zero excess entropy. The third system is a heterogeneous combination of integrated and dis-integrated dynamics. Considering the limiting cases of ΦWMS can help build intuition about how the ΦID framework describes diverse dynamics. I hypothesized that the disintegrated system should, generally, not have much redundant temporal mutual information, as the elements are independent of each-other, and there should be little information transfer between individual elements. Similarly, in the case of the integrated system, I expected low redundancy, and a high degree of synergy (as the future of the whole can only be partially predicted by knowing the past of the whole).

Fig 3. Transition probability matrices for simple Boolean network systems.

On the left is a disintegrated system, where E(X) = E(X1) + E(X2) = 2bit (i.e. the whole is equal to the sum of the parts). In the middle is a highly integrated system, where E(X) = 1 bit and both E(Xi) = 0 bit (i.e. the whole is greater than the sum of it’s parts). On the right, a random system, combining a heterogenous mixture of integrated and segregated dynamics.

3.1.1 Disintegrated system.

The first system, SD is a “disintegrated” system, in that each of the two dynamic elements is disconnected from the other: both predict their own futures with total determinism (the pattern is an oscillation 1 → 0 → 1 → …), however there is no integration. Consequently, the excess entropy bit, and both individual excess entropies are each 1 bit: the “whole” is trivially reducible to the sum of its parts, since there’s no actual interaction between elements. For a visualization of the state-transition matrix, see Fig 3, left.

Decomposing the excess entropy using Iτsx reveals several interesting relationships (for the full decomposition, see Table 1). As expected, the strongest information atoms are the element-wise information “storage” atoms: {1} → {1} and {2} → {2}. This is consistent with our intuition that the future of each element is best predicted by its own immediate past. The fact that the unique information storage atoms are also the largest single atoms is consistent with the idea that the most informative information dynamic about the whole system is the behaviour of the individual nodes considered individually. The informative interaction between the unique information and the synergistic information are also consistent with our intuitions about the disintegrated systems. For instance, the informative value of the “upward-causation” atom {1} → {12} reflects the fact that knowing the state of also informs on the state of the whole Xt in the future: knowing rules out any configuration of Xt where Xt = 0 (since the values of each Xi oscillate). Likewise for the “downward-causation” atoms such as {12} → {1}: knowing the state of the whole at time tτ constrains the individual parts at time t (albeit, to a lesser extent than they constrain themselves).

Table 1. Table of integrated information atoms.

For each of the three Boolean systems, I performed the full integrated information decomposition, resulting in sixteen distinct ΦI atoms. The distinct global dynamics are reflected in the varying distributions of informative and misinformative information modes. I note that time-reversible systems (i.e. those where the probability of transitioning from state i to state j is the same as the reverse transition) have a more constrained and symmetrical structure than the heterogeneous system. Whether this is a universal fact about reversible versus irreversible dynamics remains an intriguing topic for future research.

The non-zero value of the double redundancy atom {1}{2} → {1}{2} is unexpected, although not inexplicable. Suppose that, at time tτ, OR . Since both Xi oscillate 0 → 1 → 0 → 1…, learning the state of either past variable is enough to rule out one possible joint future: Xt = (0, 0). Consequently, the union probability of all possible futures consistent with OR increases.

The final set of atoms worth exploring are the negative value for the information transfer terms (such as {1} → {2}). The temporal mutual information bit, but the partial information atom is less than zero. Why? The answer is that computing the information transfer atom requires subtracting the sum of all the atoms that precede it on the lattice from the temporal mutual information. In this case, this is only the non-zero, double redundancy atom {1}{2} → {1}{2}, which when subtracted off, produces a negative value (since the temporal mutual information is 0 bit).

3.1.2 Integrated system.

The second system, SG is an “integrated” system, in that there the whole system has 1 bit of excess entropy, but both elements have individual excesses entropies of 0 bit. This is accomplished using a parity check function: at every time step, the parity of the system is preserved, but the individual assignments are done randomly. For example, if , then could equal (0, 1) or (1, 0) with equal probability (but never (0, 0) or (1, 1)). For a visualization of the state-transition matrix, see Fig 3, center.

The first interesting finding is that, as with SD, the double-redundancy atom is unexpectedly positive. This occurs because, as before, learning OR is sufficient to inform on the future. Suppose that OR . There are two configurations with an odd-parity consistent with those conditions ((0,1) and (1,0)), and only one configuration with even parity ((0,0)). This means that the future state will also be more likely to have an odd parity than an even one.

Due to the synergistic nature of the parity-check function, it is not possible to directly project from the parity of the joint state back down onto the states of the individual elements, which explains the negative values of the various information copy and erase atoms (such as {1} → {1}{2} and {1}{2} → {1}). The non-negative values of the various unique information storage and transfer atoms ({1} → {1} and {1} → {2}) occurs because the mutual informations between any pair of elements are all 0 bit (i.e. bit, bit, etc), however, the sum of all atoms lower down on the lattice is negative. For example, consider atom {1} → {1}. The excess entropy E(X1) = 0 bit, however {1}{2} → {1}{2}+ {1}{2} → {1}+ {1} → {1}{2} = −0.152 bit, so the value of {1} → {1} works out to 0−(−0.152) bit: 0.152. The result can be admittedly difficult to interpret, although negative atoms in classic, single-target PIDs are also relatively widespread (see [10, 33, 34]). One possible interpretation, explored more in Section 3.1.4 is that these values reflect how the context provided by the dynamics of the whole system can influence our interpretation of dynamics of the individual parts. For example, when considered alone, there is no predictive information about the future of X1 in it’s own past, however, X1’s evolution through time is not an autonomous process, but occurs in the context of X2’s dynamics. Consequently, the kinds of inferences one can make about X1 when considering the other elements of the system may be different from the kinds of inferences that could be made if X1 were considered alone.

The higher-order, synergistic atoms show the same effect: since the redundancy-to-synergy atoms ({1}{2} → {12} and {12} → {1}{2}) are both negative, the upward and downward-causation atoms are positive, despite the fact that the relevant mutual informations are all zero bit.

3.1.3 Heterogeneous system.

The final system was one with heterogeneous transitions, with probabilities drawn from a Gaussian distribution (for details, see Varley & Hoel [18]). In contrast to the prior two systems, this system, SH does not have an a priori fixed “type” of dynamic and was expected to display multiple types of information conversion. From the outset, I anticipated evidence of synergistic dynamics, as the excess entropy of the whole system was 0.422 bit, while each of the two elements had individual temporal mutual informations of 0.017 bit and 0.001 bit respectively, indicating a dynamic where the whole is much more predictive than the sum of its parts. For a visualization of the state-transition matrix, see Fig 3, right.

Consistent with expectations, SH did not have the same regularity of information dynamics displayed by SD and SG: for example, the atom {1}{2} → {1} was negative, indicating a misinformative relationship, while {1}{2} → {2} was positive and had a greater absolute value. Similarly, the conversion from redundant to synergistic information and vice versa both has opposite signs, suggesting that this system simultaneously displays informative “downward causation”, but misinformative “upward causation”. In totality, there were more informative integrated information atoms than misinformative ones (a ratio of 11 to 5), showing that, despite the overall strongly synergistic nature of the system, unique information transfer and redundant information dynamics all co-existed together. This is consistent with previous work that found that these kinds of modified-Gaussian systems can display a wide range of information dynamics, at multiple scales [18].

3.1.4 Interpreting negative local ΦI atoms.

The three systems described above provide concrete toy models which can be used to build intuition about the phenomena of negative ΦI values. Consider the heterogeneous system SH: specifically the information that communicates to itself at time t (). This value, sometimes called active information storage [20], is an expected mutual information and must be non-negative: in this case, it is 0.001 bit. If an observer were just observing the dynamics of S2 (and ignoring S1 entirely), their uncertainty about the future of S2 would be reduced by observing its past. Despite this, the “stored” partial information {2}{2} is negative. How can this occur?

Our interpretation is that this mismatch is explained by the fact that the evolution of S2 occurs in the context of all other elements in the system. Analysing S2 on its own can be misleading, because its dynamics are informed by the states of the other elements. In this case, the lions share of the information is information that initially present in both atoms, and then erased from S1: {1}{2} → {2}. So, a significant amount of the information that one observes in is not specific to S2 (at least not at first), but rather, emerges from the interaction with S1.

In the case of these toy systems, there is no “mechanism” to be explored, however, this kind of distinction may be of value when analysing real-world systems. For example, a scientist studying the activity of neurons may observe a non-zero active information storage and propose to connect that to biophysical processes such as refractory periods [42], however, the ΦID analysis shows that this information may not actually reflect the specific dynamics of a single neuron, but rather, some contextual interaction between two, and the actual memory capacity of the neuron is altogether different (in this case, misinformative, rather than informative). This example is, at present, admittedly speculative, considerable work remains to understand how the outputs of the ΦID algorithm map onto complex, real-world dynamical process.

3.2 Analysis of dissociated neural culture data

To demonstrate how decomposition of the excess entropy using Iτsx might be applied to empirical data, I analysed 31 dissociated cultures of rat hippocampal cortex. These preparations were made by resecting slices of embryonic rat cortex, and then culturing them to produce networks of living neuronal tissue [11]. After preparation of the cultures and a period of maturation, spontaneous spiking activity was then recorded on a 60 electrode array and spike-sorted to produce a time series of spikes for each putative neuron (for details, see the original manuscript presenting these data [12] and the Materials & methods Section).

Dissociated and organotypic cultures have been a highly productive model system for research into information dynamics and “computation” in biological systems: for example, see studies of the relationship between criticality and information-theoretic complexity [12, 43], network structure and synergy [4446], changes to computational structure during maturation and development [4749], and the topology of effective networks [5052]. In many respects, they are a natural fit for these kinds of information theoretic analyses: neuronal activity is naturally discrete (in the form of action potentials which can be represented with binary states), the neuron is a well-defined “unit” (a single cell), and the communication channels between units is well-understood at the mechanistic level (neurons communicate over synapses via the release of neurotransmitters), as are the general causal effects of interaction (neurons can be inhibitory or excitatory, a relationship easily expressible in terms of Bayesian prior and posterior probabilities [53]).

In this study, I demonstrate the utility of Iτsx as both an expected and localizable measure of information-sharing by examining the pairwise relationships between neurons. Our particular focus is on avalanches of high-firing activity, which are typical of neural systems and systems poised near a critical phase transition in general. While the question of criticality in the brain is a complex question (for review, see [54, 55], and for a dissenting view, see [56]), it is an empirical fact that spontaneous activity in cortical networks displays avalanche dynamics of widely varying lengths (typically modeled as following a power-law, or other heavy-tailed distribution [57]). While the existence of such avalanches is extremely well-documented, and their genesis the subject of intensive modeling work, it is still unclear what, if any, role they play in cortical computations. Varley et al., hypothesized that they may play an integrative role after finding that loss of consciousness via the anaesthetic propfol caused pronounced collapse of large-scale avalanche structure [58], however, such hypotheses remain highly speculative in the absence of a formal framework for understanding localizable computation. I propose that the ΦID framework, coupled with the intrinsically local nature of Iτsx solves that problem.

3.2.1 Distributions of average ΦID atoms.

For each of the 31 cultures, I calculated the lag-1 excess entropy for every pair of nodes in the network (restricting our analysis to consecutive bins within avalanches, as in [49]). If the expected excess entropy was significant at α = 10−6, (Bonferroni corrected), I went on to do the full integrated information decomposition. The result is, for every culture, across all pairs of nodes with significant excess entropy, I can compute sixteen distinct pairwise “integrated information matrices” (for visualization see Fig 4). For these expected values, I normalized each one by dividing it by its associated excess entropy to control for the variability in the overall amount of temporal information.

Fig 4. Visualized normalized ΦI matrices for a single culture.

For a single culture (in this case with approximately one hundred individual neurons), one can construct sixteen different pairwise matrices, each one corresponding to a ΦI atom. This contrasts with more well-known measures of functional and effective connectivity, which produce one matrix per system, reflecting a single “kind” of statistical relationship (be it functional connectivity, effective connectivity, etc). Integrated information decomposition, on the other hand, provides multiple “kinds” of relationship at once, allowing a far more complete picture of computational dynamics. Here, the value of each atom is normalized by the total excess entropy.

To explore the overall distribution of normalized information atoms, I aggregated over all cultures to create histograms of the various ΦID components (Fig 5). I found that the element-level information storage atoms ({x} → {x}) had the overall highest average normalized value (0.417 ± 0.422), followed by the element-level information transfer atoms ({x} → {y}, 0.097 ± 0.195). These results are consistent with our initial expectations: individual neurons are known to have a strong individual temporal dependence [42, 59], likely reflecting the refractory period following an action potential). Similarly, the high element-wise information transfer is consistent with the basic mode of communication between neurons being pairwise synaptic signaling. The other modes of information conversion, however, remain more mysterious: for example, the information copy and information erasure atoms ({x} → {1}{2} and {1}{2} → {x} respectively) both had values of 0.011 ± 0.0325, which is lower than the transfer atoms, but by less than an order of magnitude. Exactly what kind of biological process these modes correspond to is a promising area of future study. While every atom had particular pairs of neurons for which it was negative, at the aggregate level, every atom was, on average, greater than zero, including the higher-order measures, such as the double synergy ({12} → {12}). These results show that spontaneous, on-going avalanche dynamics have a significant, element of consistently synergistic activity. For a complete set of correlations between all the atoms, see S1 Fig. I can also see that the information transfer atoms overall generally have the highest absolute values.

Fig 5. Histograms of the normalized ΦI atoms across all cultures.

The distributions of the normalized atoms show marked differences, depending on the particular kinds of information conversion occurring. For example, the element-wise information storage atom as the highest mean value and is considerably biased towards informative, positive relationships, while other measures display a more symmmetric balance of informative and misinformative atoms (although all atoms displayed a bias towards informative relationships.

To compare the results of the integrated information decomposition to a more established measure of systemic complexity, I compared the distribution of normalized ΦID atoms to a measure of integrated information first proposed by Balduzzi & Tononi based on the difference between the total excess entropy and the sum of the two marginal excess entropies [60]: (24)

Typically referred to as ΦWMS (WMS indicating “whole-minus-sum”), it is a useful measure of non-trivial systemic integration (see [37] for a recent exploration of ΦWMS in a ΦID context). ΦWMS has obvious parallels with the simple toy example of two predictors and a single target introduced in Section 1.1.1, with similar interpretations of the resulting sign (i.e. if ΦWMS > 0, then the system has synergistic dynamics only accessible when considering the whole as opposed to the independent parts). As with the histograms, I aggregated over all significant pairs of neurons in all the cultures, and correlated each ones ΦWMS against each of the normalized ΦID atoms. For visualization, see Fig 6.

Fig 6. ΦWMS vs. normalized ΦI atoms.

The different normalized ΦI atoms have varying degrees of correlation with the ΦWMS measure of integrated information [60]. While most are generally positively correlated, the element-level information-storage atom is a dramatic outlier, with a highly significant negative correlation of -0.8. I believe this occurs because a high degree of information storage in single elements means that the future of the whole is mostly predictable from the individual parts. The more individual elements disclose about their own future, the less “integrated” information in the system.

Spearman correlation found that there was a very strong, negative correlation between ΦWMS and the normalized information storage atoms ({x} → {x}, ρ = −0.8, p < 10−6, Bonferroni corrected). This is unsurprising, as information storage contributes to the marginal, within-element predictive information and contributes nothing to the higher-order interactions that comprise “integrated” information (consider the “disintegrated” toy model described above in Section 3.1). All other normalized ΦI atoms were positively correlated with ΦWMS. The highest correlation was with the element-wise information transfer atoms ({x} → {y}, ρ = 0.57, p < 10−6, Bonferroni corrected). Since inter-element information transfer is a core element of systemic “integration”, and considering the overall high prevalence of bivariate transfer in the data (see Fig 5), this result is unsurprising. As expected, the ΦI atoms containing higher-order synergies were all positively correlated with ΦWMS, with the double-synergy term having one of the highest overall correlations (ρ = 0.41, p < 10−6, Bonferroni corrected). This is consistent with the interpretation that ΦWMS is an overall measure of total total systemic integration.

3.2.2 Local ΦID analysis.

In addition to the average values of the integrated information atoms, the Iτsx measure is localizable, allowing us to do a full, sixteen-atom decomposition for every moment in time, for every pair of neurons with significant excess entropy. I can leverage this property to perform a detailed analysis of the avalanches as temporally-extended objects qua themselves (rather than treating them as single units sampled from some heavy-tailed distribution). Across all pairs of neurons in all 31 cultures, I aggregated all avalanches of length k > 4, and if I observed at least 50 instances of avalanches of length k, I averaged them to create an “average profile.” Prior work with dissociated culture data has shown that avalanche profiles tend to be scaled versions of one another [12] (and references therein), showing a characteristic growth and then collapse of activity over the duration (for a visualization of the average avalanche profiles, see Fig 7, Upper Left). For every moment in the avalanches, I computed the local excess entropy, and then performed the ΦID using the local iτsx to explore how the computational dynamics vary over the course of the avalanche. For a visualization of the profiles of the avalanches, the excess entropy, and all ΦI atoms, see Fig 7. Local ΦI atoms were not normalized, as the local excess entropy is a signed value, complicating the interpretation of a normalized value.

Fig 7. Average avalanche profile plots for spiking activity.

Each curve is the average profile for avalanches of duration k > 4 bins, if at least fifty avalanches of that duration were observed across the all thirty-one cultures. On the upper-leftmost square, one can see the average profile for raw spiking activity (copper colormap). In the uppermost center plot, one can see the average profile for the local excess entropy (blue-green colormap), and for the rest of the plots, the remaining ΦI atoms (violet-orange colormap). I can observe that different atoms have distinct characteristic profiles, some of which resemble the excess entropy more than others.

Upon visual inspection, it is clear that the various ΦI atoms have distinct profiles: for example, the profiles of the element-wise information storage and transfer atoms are characteristically similar to the excess entropy profiles, with rapid increases to a peak followed by a heavy tail. In contrast the double-synergy profile has a noisier shape, appearing to drop towards misinformation at the end of the avalanche. To explore these profile differences in more detail, I directly compared the spiking activity profiles to their associated informational profiles. I began by computing the cumulative profile for each avalanche: in the cumulative avalanche, every moment is given as the sum of all previous moments, including the current one (analogous to a cumulative probability distribution). I then scaled each distribution by dividing it by the final, cumulative value, forcing all cumulative avalanches to terminate at 1. Finally, I filtered outlying cumulative avalanches that had unusually extreme deviations under the assumptions that they were contaminated by noise. By plotting the cumulative information atom avalanche distributions against the cumulative spiking avalanche distributions, it is possible to assess how the growth and collapse of information atoms differs from the change in spiking dynamics (see Fig 8). If the information atoms track the spiking activity perfectly, then the resulting curves will fall on the y = x line. Deviations from the line of symmetry indicate a faster or slower accumulation of information than would be expected if it was perfectly correlated with spiking activity.

Fig 8. Cumulative information avalanche profiles plotted against cumulative spiking avalanche profiles.

These plots allow us to assess how the density of information atoms varies over the duration of the avalanche relative to the spiking activity that defines the avalanche. The black dotted line indicates the y = x line of symmetry: if the information density of an atom hugs that line, then the profiles of both the information and the spiking activity are the same. In many cases, the information profiles appear to dramatically diverge from the line of symmetry, indicating that avalanches are “informationally front-leaded”, at least with respect to certain types of information integration.

Visual inspection of the excess entropy cumulative profile reveals that avalanches are broadly-speaking informationally “front-heavy”, the local excess entropy climbs much faster than spikes accumulate (as seen by the curve climbing above the y = x line), and has almost entirely “saturated” before halfway through the avalanche. When considering avalanches of differing lengths, this front-heaviness appears to become more pronounced for larger avalanches (for small avalanches of length between 4 and 10, the normalized cumulative distribution curves hug the line of symmetry much more closely). This suggests that, while all spiking avalanche profiles may be roughly scaled versions of each-other, that scaling is not universal when it comes to information content: larger avalanches have different information profiles than smaller ones.

The pattern displayed by the cumulative excess entropy profile is broadly mirrored by the individual ΦI atoms, although is the considerable variation between them. For example, the synergy-to-redundancy atom {12} → {1}{2} (and it’s mirror {1}{2} → {12}) both hug the line of symmetry much more closely. In contrast, the the cumulative double redundancy profiles and the cumulative information storage profiles track the cumulative excess entropy much more closely. Interestingly, the cumulative information copy and erasure profiles ({x} → {1}{2} and {1}{2} → {x}) both achieve a maximum value before the end of the avalanche and then drop down, indicating a transition from informative to misinformative dynamics towards the end of the activity period. The cumulative double-synergy profile shows one of the most intriguing patterns: for large avalanches, it appears to have an S-shaped profile, initially climbing rapidly during the avalanche, before dropping across the line of symmetry. The significance of such a dynamic is unclear, and this is a finding well worth revisiting and replicating in a future data set.

Another interesting type of variability between atoms is how the profile changes with avalanche duration. In the case of cumulative excess entropy, cumulative double-redundancy, and cumulaive information storage, small avalanches reliably hug the line of symmetry and it is the larger avalanches that display interesting deviations. However, this is not the only pattern: for example the “downward causation” atom ({12} → {x}) and the information erasure atoms both appear to display a kind of biphasic pattern: smaller avalanches (indicated by violet in Fig 8) run reliably below the line of symmetry, while large avalanches (indicated in orange) run above it.

From these results, it is clear that the ΦID framework, coupled with a localizable measure such as Iτsx can provide a rich, novel approach to understanding ongoing neural activity and reveal patterns never before observed. For the purposes of this paper, I restricted myself largely to qualitative analysis of local integrated information dynamics: the results presented here will require ample replication and much deeper study to determine their significance.

4 Discussion

In this work, I have presented a novel information-theoretic measure, Iτsx, a generalization of the classic Shannon mutual information, that quantifies the redundant information shared between multiple sources and multiple targets. Iτsx is motivated by the recently proposed Integrated Information Decomposition [3, 8], which generalizes the classic single-target Partial Information Decomposition [6, 7] to sets of multiple interacted sources and targets. Like all information decompositions, the ΦID is peculiar in that, while it reveals the structure of multivariate information, it lacks a crucial piece required to calculate numerical values from data. This is solved by providing Iτsx as a redundancy function, with which the double redundancy lattice can be solved.

Here, the ΦID framework is used to decompose the excess entropy [1], which quantifies the total amount of statistical dependencies that constrains a systems evolution from past to future. Prior work [18] on using PID to decompose the excess entropy could reveal how the past states of individual components (and ensembles of components) constrain the future of the whole system, but provided no finer detail. Using the ΦID, it is possible to understand how elements constrain their own futures, the future of other elements, groups of elements or the whole system in much finer resolution. To demonstrate the utility of the Iτsx measure, I first examined three small, completely specified toy models (each with its own enforced type of dynamic: integrated, disintegrated, or a mixture of the two) before moving on the empirical data recorded from dissociated cultures of rat cortex. I showed that both the average and local versions of Iτsx revealed rich information-dynamic structures in the data, including how different kinds of “neural computation” rise and fall as part of the bursty dynamics intrinsic to the nervous system.

A significant benefit of the ΦID framework is that is allows us to generalize different “kinds” of integration in a complex system such as the brain. Historically, information-theoretic approaches to integration have focused on single measures, such as integrated information theory’s eponymous measure [60]. The information decomposition framework, however reveals a multitude of different ways that groups of neurons compute their next state. Recent, promising work using fMRI data has started to relate various ΦI atoms (particular the synergistic atoms) to macro-scale brain dynamics [24, 26], as well as different subcritical, critical, and supercritical dynamical regimes of various dynamical systems [37]. Given the wealth of data produced by modern neural recording methods, I am optimistic that there is a very wide world of possible applications of this framework.

While I have focused on the ΦID framework as a means of decomposing the excess entropy of ongoing, spontaneous neural dynamics in dissociated cultures, in principle the framework could apply to any data set with multiple, interacting predictor and predicted variables: the temporal dimension is not required. This opens up a wider range of applications of data analyses than is accessible to the classic PID—for example, Varley & Kaminski recently used the PID to asses how varying social identities (such as race and sex) jointly disclose information on single outcomes (such as income or health status) [61], however outcomes themselves are not independent and may contain interesting higher-order correlations within themselves. For example, how do the identities race and sex disclose information about income and health outcomes collectively? Generalizing to a ΦID framework may reveal many meaningful dependencies within social data, as well as many other fields where complex systems are studied.

4.1 Limitations

As currently formulated, the Iτsx function is only well-defined for discrete random variables, a feature that it inherits from the original Isx measure [10]. Continuous generalization of Isx remains an area of active research [36] and it is assumed that a successful algorithm for Isx will also work for Iτsx. As it stands, the restriction to discrete random variables limits applicability. Prior work applying PID and ΦID to naturally continuous data such as fMRI or cardiac rhythms has been done using measures of redundancy that are well-defined for Gaussian distributions [24, 26, 37], although these measures have their own limitations, such as lacking the intuitive interpretation, being non-localizable, or requiring arbitrary thresholds or optimizations.

Even in the event that a successful generalization of Iτsx is achieved, the PID and ΦID frameworks struggle to scale gracefully for all but the smallest systems. In the case of the PID, the number of atoms in the lattice of a system of size k grows with the sequence of Dedekind numbers [7]: for a system with k elements, the associated lattice has D(k) − 2 atoms. Given how fast the Dedekind sequence grows, a complete decomposition of almost any interesting natural system (which can have thousands, or millions of components) is impossible. The ΦID framework fares even worse, since there will be one temporal atom for every pair of partial information atoms in the associated PID lattice. The size of the ΦID lattice then grows with the mind-boggling square of the Dedekind numbers: (D(k) − 2)2 (a five element system will have a ΦID lattice with 57,471,561 elements). Approximate heuristics such as the ΦWMS measure, or more recently, the O-information [6264] have been proposed as efficient, if imprecise, tools for recognizing the presence of higher-order dependencies in dynamical data, however, there is still room for refinement. Another possibility might be to explore temporal PID (rather than ΦID) using a redundancy function equipped with a target chain rule [13]. The target chain rule generalizes the chain rule of mutual information to the PID: a redundancy function i follows the rule if i(α; y1, y2) = i(α; y1) + i(α; y2|y1). Conveniently, the isx measure satisfies the target chain rule, so a future avenue of research is to compare the results of iτsx with the chained isx.

The final limitation is that the the structure of the ΦID lattice, which allows for single sources to appear multiple times (e.g. {12} → {x} and {12} → {12} both incorporating the {12} source) complicates the overall behavior of the redundancy functions. For example, the original Isx function has certain, provable properties (such as the global non-negativity of it’s informative and misinformative components) that Iτsx cannot adopt, since the structure of the lattice is different. This strong suggests that a return to the mathematical foundations of integrated information decomposition may be in order and new desiderata agreed on that may diverge from the single-target case.

5 Conclusions

In this work, I provide a redundancy function, Iτsx that can be used to decompose the total information that flows from the past to the future through the “channel” of a multi-element, dynamic system. This framework, when applied to neural data reveals a rich repertoire of complex computational dynamics that can be temporally localized to the scale of individual moments in time. Based on the fundamental logic of information as exclusions of probability mass, Iτsx generalizes the classic Shannon entropy and I anticipate that the work presented here will open new doors both in the specific fields of neuroscience as well as in complex systems science more generally.

6 Materials & methods

6.1 Dissociated culture preparation & recording

The details of the general process for the preparation of dissociated cultures can be found in [11]. Here I summarize the specific methodologies detailed in [12], who first introduced this dataset. Pregnant Sprague-Dawley rats (Harlan Laboratories) on Day 18 of gestation were euthanized via CO2 and the embryos removed. Embryonic hippocampal tissue was ressected and dissociated en mass before being plated on a Multichannel Systems 60 electrode arrays (8 × 8, 200 μm electrode spacing, 30 μm electrode diameter). Spontaneous activity was recording at 20,000 Hz for approximately 1 hour (for this analysis, all recordings longer than 60 minutes were terminated at that point). The resulting spikes were sorted with the wave_Clus algorithm [65] to infer individual neurons. Following spike sorting, the data were rebinned to 3ms bins (approximating the average inter-spike interval for the set of all 31 the recordings).

6.2 Mutual information calculation & significance testing

For every pair of neurons in a given culture, I calculated the mutual information between those two nodes at time t and the same two nodes at time t + 1: (25)

Where H(⋅) is the classical Shannon entropy function. I significance tested each pair against the analytic null distribution for discrete random variables with finite alphabets [66, 67], with an α = 10−6, followed by Bonferroni correction. The analytic null estimator allows for very efficient estimation of p-values, requiring minimal compute time (and reducing the associated carbon costs associated with time-intensive high-performance computing). I used the implementation provided by JIDT [68], accessed via the IDTxl package [69] for its efficient Python interface.

6.3 Constructing toy Boolean networks

For the integrated and disintegrated example systems, the transition probabilities were worked out by hand from first principles. The heterogenous system was constructed based on the details provided in [18]. Briefly, a 4 × 4 transition probability matrix was initialized, and every entry Mij was drawn from a normal distribution with unit mean and variance. The absolute value was taken, and the out-going probabilities normalized to define a discrete probability distribution.

6.4 Excluding noisy cumulative avalanche profiles

To remove information avalanche profiles excessively contaminated by noise, I excluded any cumulative avalanche profiles that had an excursion of more than 1 bit away from the y = x line or a total length greater than 2 bit. With these thresholds, I excluded on average 7.5 ± 8.18 avalanches for each ΦI atom. To see the full set of unfiltered cumulative avalanche plots, see S2 Fig.

Supporting information

S1 Fig. All pairwise correlations between normalized ΦI atoms.

Represented as two-dimensional log-probability density hexagonal histograms. The middle diagonal replicates the histograms seen in Fig 5. The correlations between various atoms are complex and not always trivial, or linear.


S2 Fig. All cumulative avalanche plots without the filters.

Visual comparion with Fig 8 shows that the overall pattern can still be discerned despite the very noisy avalanches.


S1 Appendix. Proofs & worked examples.

Various proofs and a worked example of computing Iτsx.



I would like to thank Dr. John Beggs & Dr. Olaf Sporns for mentorship and feedback on this project, and Ms. Maria Pope for feedback as well. I would also like to thank Dr. Abolfazl Alipour and Mr. Leandro Fosque for providing the dissociated culture data.


  1. 1. Crutchfield JP, Feldman DP. Regularities unseen, randomness observed: Levels of entropy convergence. Chaos: An Interdisciplinary Journal of Nonlinear Science. 2003;13(1):25–54. pmid:12675408
  2. 2. James RG, Ellison CJ, Crutchfield JP. Anatomy of a bit: Information in a time series observation. Chaos: An Interdisciplinary Journal of Nonlinear Science. 2011;21(3):037109.
  3. 3. Mediano PAM, Rosas FE, Luppi AI, Carhart-Harris RL, Bor D, Seth AK, et al. Towards an extended taxonomy of information dynamics via Integrated Information Decomposition. arXiv:210913186 [physics, q-bio]. 2021.
  4. 4. Bedau MA. Downward Causation and the Autonomy of Weak Emergence. Principia. 2002;6(1):5–50.
  5. 5. Galaaen AS. The Disturbing Matter of Downward Causation [PhD Thesis]. University of Oslo; 2006.
  6. 6. Williams PL, Beer RD. Nonnegative Decomposition of Multivariate Information. arXiv:10042515 [math-ph, physics:physics, q-bio]. 2010.
  7. 7. Gutknecht AJ, Wibral M, Makkeh A. Bits and pieces: understanding information decomposition from part-whole relationships and formal logic. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2021;477(2251):20210110. pmid:35197799
  8. 8. Mediano PAM, Rosas F, Carhart-Harris RL, Seth AK, Barrett AB. Beyond integrated information: A taxonomy of information dynamics phenomena. arXiv:190902297 [physics, q-bio]. 2019.
  9. 9. Rosas FE, Mediano PAM, Jensen HJ, Seth AK, Barrett AB, Carhart-Harris RL, et al. Reconciling emergences: An information-theoretic approach to identify causal emergence in multivariate data. PLOS Computational Biology. 2020;16(12):e1008289. pmid:33347467
  10. 10. Makkeh A, Gutknecht AJ, Wibral M. Introducing a differentiable measure of pointwise shared information. Physical Review E. 2021;103(3):032149. pmid:33862718
  11. 11. Hales CM, Rolston JD, Potter SM. How to culture, record and stimulate neuronal networks on micro-electrode arrays (MEAs). Journal of Visualized Experiments: JoVE. 2010;(39):2056. pmid:20517199
  12. 12. Timme NM, Marshall NJ, Bennett N, Ripp M, Lautzenhiser E, Beggs JM. Criticality Maximizes Complexity in Neural Tissue. Frontiers in Physiology. 2016;7. pmid:27729870
  13. 13. Bertschinger N, Rauh J, Olbrich E, Jost J. Shared Information—New Insights and Problems in Decomposing Information in Complex Systems. arXiv:12105902 [cs, math]. 2013; p. 251–269.
  14. 14. Bertschinger N, Rauh J, Olbrich E, Jost J, Ay N. Quantifying Unique Information. Entropy. 2014;16(4):2161–2183.
  15. 15. James RG, Emenheiser J, Crutchfield JP. Unique Information and Secret Key Agreement. Entropy. 2019;21(1):12.
  16. 16. Quax R, Har-Shemesh O, Sloot PMA. Quantifying Synergistic Information Using Intermediate Stochastic Variables. Entropy. 2017;19(2):85.
  17. 17. Rosas FE, Mediano PAM, Rassouli B, Barrett AB. An operational information decomposition via synergistic disclosure. Journal of Physics A: Mathematical and Theoretical. 2020;53(48):485001.
  18. 18. Varley TF, Hoel E. Emergence as the conversion of information: a unifying theory. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2022;380(2227):20210150. pmid:35599561
  19. 19. Mediano PAM, Rosas FE, Luppi AI, Jensen HJ, Seth AK, Barrett AB, et al. Greater than the parts: a review of the information decomposition approach to causal emergence. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2022;380(2227):20210246.
  20. 20. Lizier JT. The Local Information Dynamics of Distributed Computation in Complex Systems. Springer Theses. Berlin, Heidelberg: Springer Berlin Heidelberg; 2013. Available from:
  21. 21. Varley TF. Flickering Emergences: The Question of Locality in Information-Theoretic Approaches to Emergence. Entropy. 2023;25(1):54.
  22. 22. Bossomaier T, Barnett L, Harré M, Lizier JT. An Introduction to Transfer Entropy: Information Flow in Complex Systems. Springer; 2016.
  23. 23. Williams PL, Beer RD. Generalized Measures of Information Transfer. arXiv:11021507 [physics]. 2011.
  24. 24. Luppi AI, Mediano PAM, Rosas FE, Allanson J, Pickard JD, Carhart-Harris RL, et al. A Synergistic Workspace for Human Consciousness Revealed by Integrated Information Decomposition. bioRxiv. 2020; p. 2020.11.25.398081.
  25. 25. Luppi AI, Mediano PAM, Rosas FE, Harrison DJ, Carhart-Harris RL, Bor D, et al. What it is like to be a bit: an integrated information decomposition account of emergent mental phenomena. Neuroscience of Consciousness. 2021;2021(2). pmid:34804593
  26. 26. Luppi AI, Mediano PAM, Rosas FE, Holland N, Fryer TD, O’Brien JT, et al. A synergistic core for human brain evolution and cognition. Nature Neuroscience. 2022; p. 1–12. pmid:35618951
  27. 27. Harder M, Salge C, Polani D. Bivariate measure of redundant information. Physical Review E, Statistical, Nonlinear, and Soft Matter Physics. 2013;87(1):012130. pmid:23410306
  28. 28. Griffith V, Chong EKP, James RG, Ellison CJ, Crutchfield JP. Intersection Information Based on Common Randomness. Entropy. 2014;16(4):1985–2000.
  29. 29. Griffith V, Koch C. Quantifying synergistic mutual information. arXiv:12054265 [cs, math, q-bio]. 2014.
  30. 30. Olbrich E, Bertschinger N, Rauh J. Information Decomposition and Synergy. Entropy. 2015;17(5):3501–3517.
  31. 31. Barrett AB. Exploration of synergistic and redundant information sharing in static and dynamical Gaussian systems. Physical Review E. 2015;91(5):052802. pmid:26066207
  32. 32. Goodwell AE, Kumar P. Temporal information partitioning: Characterizing synergy, uniqueness, and redundancy in interacting environmental variables. Water Resources Research. 2017;53(7):5920–5942.
  33. 33. Ince RAA. Measuring Multivariate Redundant Information with Pointwise Common Change in Surprisal. Entropy. 2017;19(7):318.
  34. 34. Finn C, Lizier JT. Pointwise Partial Information Decomposition Using the Specificity and Ambiguity Lattices. Entropy. 2018;20(4):297. pmid:33265388
  35. 35. Ay N, Polani D, Virgo N. Information Decomposition based on Cooperative Game Theory. arXiv:191005979 [cs, math]. 2019.
  36. 36. Schick-Poland K, Makkeh A, Gutknecht AJ, Wollstadt P, Sturm A, Wibral M. A partial information decomposition for discrete and continuous variables. arXiv:210612393 [cs, math]. 2021.
  37. 37. Mediano PAM, Rosas FE, Farah JC, Shanahan M, Bor D, Barrett AB. Integrated information as a common signature of dynamical and information-processing complexity. Chaos: An Interdisciplinary Journal of Nonlinear Science. 2022;32(1):013115. pmid:35105139
  38. 38. Finn C, Lizier JT. Probability Mass Exclusions and the Directed Components of Mutual Information. Entropy. 2018;20(11):826. pmid:33266550
  39. 39. Ince RAA. The Partial Entropy Decomposition: Decomposing multivariate entropy and mutual information via pointwise common surprisal. arXiv:170201591 [cs, math, q-bio, stat]. 2017.
  40. 40. Finn C, Lizier JT. Generalised Measures of Multivariate Information Content. Entropy. 2020;22(2):216. pmid:33285991
  41. 41. Varley TF, Pope M, Puxeddu MG, Faskowitz J, Sporns O. Partial entropy decomposition reveals higher-order structures in human brain activity Available from:
  42. 42. Varley TF, Sporns O, Schaffelhofer S, Scherberger H, Dann B. Information-processing dynamics in neural networks of macaque cerebral cortex reflect cognitive state and behavior. Proceedings of the National Academy of Sciences. 2023;120(2):e2207677120. pmid:36603032
  43. 43. Shew WL, Yang H, Yu S, Roy R, Plenz D. Information Capacity and Transmission Are Maximized in Balanced Cortical Networks with Neuronal Avalanches. Journal of Neuroscience. 2011;31(1):55–63. pmid:21209189
  44. 44. Faber SP, Timme NM, Beggs JM, Newman EL. Computation is concentrated in rich clubs of local cortical networks. Network Neuroscience. 2018;3(2):1–21.
  45. 45. Sherrill SP, Timme NM, Beggs JM, Newman EL. Correlated activity favors synergistic processing in local cortical networks in vitro at synaptically relevant timescales. Network Neuroscience. 2020;4(3):678–697. pmid:32885121
  46. 46. Sherrill SP, Timme NM, Beggs JM, Newman EL. Partial information decomposition reveals that synergistic neural integration is greater downstream of recurrent information flow in organotypic cortical cultures. PLOS Computational Biology. 2021;17(7):e1009196. pmid:34252081
  47. 47. Wibral M, Finn C, Wollstadt P, Lizier JT, Priesemann V. Quantifying Information Modification in Developing Neural Networks via Partial Information Decomposition. Entropy. 2017;19(9):494.
  48. 48. Antonello PC, Varley TF, Beggs J, Porcionatto M, Sporns O, Faber J. Self-organization of in vitro neuronal assemblies drives to complex network topology. bioRxiv; 2021. Available from:
  49. 49. Shorten DP, Priesemann V, Wibral M, Lizier JT. Early lock-in of structured and specialised information flows during neural development. eLife. 2022;11:e74651. pmid:35286256
  50. 50. Ito S, Yeh FC, Hiolski E, Rydygier P, Gunning DE, Hottowy P, et al. Large-Scale, High-Resolution Multielectrode-Array Recording Depicts Functional Network Differences of Cortical and Hippocampal Cultures. PLOS ONE. 2014;9(8):e105324. pmid:25126851
  51. 51. Nigam S, Shimono M, Ito S, Yeh FC, Timme N, Myroshnychenko M, et al. Rich-Club Organization in Effective Connectivity among Cortical Neurons. Journal of Neuroscience. 2016;36(3):670–684. pmid:26791200
  52. 52. Timme N, Ito S, Myroshnychenko M, Yeh FC, Hiolski E, Hottowy P, et al. Multiplex Networks of Cortical and Hippocampal Neurons Revealed at Different Timescales. PLOS ONE. 2014;9(12):e115764. pmid:25536059
  53. 53. Goetze F, Lai PY. Reconstructing positive and negative couplings in Ising spin networks by sorted local transfer entropy. Physical Review E. 2019;100(1):012121. pmid:31499780
  54. 54. Beggs JM, Timme N. Being Critical of Criticality in the Brain. Frontiers in Physiology. 2012;3. pmid:22701101
  55. 55. Beggs JM. The Critically Tuned Cortex. Neuron. 2019;104(4):623–624. pmid:31751539
  56. 56. Destexhe A, Touboul JD. Is There Sufficient Evidence for Criticality in Cortical Systems? eNeuro. 2021;8(2). pmid:33811087
  57. 57. Beggs JM, Plenz D. Neuronal Avalanches in Neocortical Circuits. Journal of Neuroscience. 2003;23(35):11167–11177. pmid:14657176
  58. 58. Varley T, Sporns O, Puce A, Beggs J. Differential effects of propofol and ketamine on critical brain dynamics. PLOS Computational Biology. 2020;16(12):e1008418. pmid:33347455
  59. 59. Wibral M, Lizier J, Vogler S, Priesemann V, Galuske R. Local active information storage as a tool to understand distributed neural information processing. Frontiers in Neuroinformatics. 2014;8. pmid:24501593
  60. 60. Balduzzi D, Tononi G. Integrated information in discrete dynamical systems: motivation and theoretical framework. PLoS computational biology. 2008;4(6):e1000091. pmid:18551165
  61. 61. Varley TF, Kaminski P. Untangling Synergistic Effects of Intersecting Social Identities with Partial Information Decomposition. Entropy. 2022;24(10):1387.
  62. 62. Rosas F, Mediano PAM, Gastpar M, Jensen HJ. Quantifying High-order Interdependencies via Multivariate Extensions of the Mutual Information. Physical Review E. 2019;100(3):032305. pmid:31640038
  63. 63. Stramaglia S, Scagliarini T, Daniels BC, Marinazzo D. Quantifying Dynamical High-Order Interdependencies From the O-Information: An Application to Neural Spiking Dynamics. Frontiers in Physiology. 2021;11. pmid:33519503
  64. 64. Varley TF, Pope M, Faskowitz J, Sporns O. Multivariate Information Theory Uncovers Synergistic Subsystems of the Human Cerebral Cortex; 2022. Available from:
  65. 65. Quiroga RQ, Nadasdy Z, Ben-Shaul Y. Unsupervised Spike Detection and Sorting with Wavelets and Superparamagnetic Clustering. Neural Computation. 2004;16(8):1661–1687. pmid:15228749
  66. 66. Brillinger DR. Some data analyses using mutual information. Brazilian Journal of Probability and Statistics. 2004;18(2):163–182.
  67. 67. Cheng PE, Liou JW, Liou M, Aston JA. Data information in contingency tables: a fallacy of hierarchical loglinear models. Journal of Data Science. 2006;4(4):387–398.
  68. 68. Lizier JT. JIDT: An Information-Theoretic Toolkit for Studying the Dynamics of Complex Systems. Frontiers in Robotics and AI. 2014;1.
  69. 69. Wollstadt P, Lizier JT, Vicente R, Finn C, Martinez-Zarzuela M, Mediano P, et al. IDTxl: The Information Dynamics Toolkit xl: a Python package for the efficient analysis of multivariate information dynamics in networks. Journal of Open Source Software. 2019;4(34):1081.