Practical Measures of Integrated Information for Time-Series Data

A recent measure of ‘integrated information’, ΦDM, quantifies the extent to which a system generates more information than the sum of its parts as it transitions between states, possibly reflecting levels of consciousness generated by neural systems. However, ΦDM is defined only for discrete Markov systems, which are unusual in biology; as a result, ΦDM can rarely be measured in practice. Here, we describe two new measures, ΦE and ΦAR, that overcome these limitations and are easy to apply to time-series data. We use simulations to demonstrate the in-practice applicability of our measures, and to explore their properties. Our results provide new opportunities for examining information integration in real and model systems and carry implications for relations between integrated information, consciousness, and other neurocognitive processes. However, our findings pose challenges for theories that ascribe physical meaning to the measured quantities.


Introduction
How can the complex dynamics exhibited by networks of interconnected elements best be measured? Answering this question promises to shed substantial new light on many complex systems, biological and non-biological. Neural systems in particular are characterized by richly interconnected elements exhibiting complex dynamics at multiple spatiotemporal scales [1], which have been associated with a variety of behavioral, cognitive, and phenomenal properties [2,3,4]. Characterizing dynamical complexity for such systems therefore presents a key challenge for developing new theoretical accounts [5] and for designing and evaluating new experiments. A common and attractive intuition is that dynamical complexity consists in the coexistence of differentiation (subsets of a system are dynamically distinct) and integration (the system as a whole exhibits coherence) in a system's dynamics. Applied to neural systems, this intuition may underpin notions of cognitive and behavioral flexibility. A system that is able to respond specifically and selectively to a broad range of stimuli, in an integrated way, may require conjoined functional integration and differentiation [6,7]. More ambitiously, the intuition may also characterize basic aspects of conscious experience [8]. At the phenomenal level, each conscious scene is composed of many different parts and is different from every other conscious scene ever experienced (differentiation), yet each conscious scene is experienced as a coherent whole (integration). Therefore, dynamical complexity in neural systems may actually account for (and not merely correlate with) fundamental aspects of consciousness [9].
Several measures now exist which operationalize the above intuition under different assumptions and with varying practical applicability [5]. In this paper, we critically evaluate 'integrated information' (W) [10,11], a candidate measure that has received significant recent attention, especially in the domain of consciousness science [12,13,14,15]. We present new versions of this measure that are both theoretically well-grounded and, in contrast to previous versions, practically applicable given time-series data. W has been proposed as a measure of the amount of information that is integrated by a system, where 'information' reflects the differentiated states of a system and 'integration' their global cohesion. According to the 'integrated information theory of consciousness' (IITC), this quantity is identical to the quantity of consciousness generated by the system; in other words, on the IITC, consciousness is integrated information [12,14]. This dramatic claim invites a close examination of the in-principle and in-practice properties of W.
A first version of W (which we call W C , 'W-capacity',) was conceived as a measure of the capacity of a system to integrate information, and did not take into account time or changing dynamics [10,12]. Also, measuring W C requires flexible, repeated, and reversible perturbation of arbitrary system subsets, which is infeasible for non-trivial systems (except in simulation). We do not discuss this measure any further. Recently, a new version of W has been introduced in the context of the IITC, which we call W DM , 'W-discrete/Markov' [11]. In contrast to W C , W DM is defined for systems of discrete elements that evolve through time with Markovian transitions. Specifically, W DM measures the information generated when a system transitions to one particular state out of a repertoire of possible states, but only to the extent that this information is generated by the whole system, over and above the information generated independently by the parts [11]. Importantly, W DM measures information as reduction in entropy from a prior maximum entropy distribution, which is taken to represent the repertoire of possible states.
It has been shown, using simulations, that W DM behaves consistently with several intuitions about dynamical complexity [11]. In particular, high values of W DM are generated by networks that exhibit both differentiation and integration in their dynamics. However, W DM is defined only for idealized discrete Markovian systems (a Markovian system is one for which the future depends only on the present, and not on the past). This in-principle restriction severely limits its in-practice applicability because complex biological systems are typically continuous (or are measured as continuous) and are non-Markovian). This limitation in turn imposes a serious obstacle for developing and evaluating theories, such as the IITC, which depend on quantifying integrated information.
In this paper we introduce an alternative measure of integrated information, W E ('W-empirical'), which is applicable to time-series data, and to continuous or discrete stochastic systems, Markovian or otherwise (and without perturbation of the studied system).
These key features arise because W E is based on the reduction in Shannon entropy from the empirical, as opposed to the maximum entropy, distribution. Our basic formulation of W E therefore addresses the in-principle restrictions of W DM mentioned above. W E is best suited for application to stationary systems, for which it provides a single value for a given stationary epoch. However, its in-practice applicability still faces the difficulty of accurately estimating entropies from limited data. This is a problem that scales poorly as the number of elements (variables) increases, especially for continuous systems [16]. Confronting this problem, we show that when states are Gaussian distributed, W E can be computed directly from empirical covariance matrices, rendering it extremely easy to apply in practice for these systems. Meanwhile, for non-Gaussian systems, we introduce a second measure, W AR ('auto-regressive W'), which is based on auto-regressive prediction error. W AR can be understood as measuring how well the present state of a system predicts some previous state, but only to the extent that predictions based on the whole outstrip predictions based on the parts considered independently. W AR and W E are constructed analogously, and indeed for Gaussian systems we are able to show, using a connection between linear regression and information theory [17,18], that they are precisely equivalent. Recognizing this equivalence allows us to interpret W E in the same way as W AR , i.e., in terms of predictive ability. Importantly, although for non-Gaussian systems W AR and W E may differ, the former remains easy to measure in practice from empirical covariance matrices.
The difference between W E /W AR and W DM is not only a matter of practical applicability. Using the empirical distribution as opposed to the maximum entropy distribution substantially changes possible interpretations of the measure. According to W E , integrated information is a measure of a process, since the empirical distribution is a characterization of the actual behavior of the system. According to W DM integrated information is to some extent a measure of capacity [14], since the maximum entropy distribution is maximally agnostic about the behavior of the system, representing instead its potential or capacity.
The above distinction carries implications for theories, such as the IITC, that ascribe physical meaning to measures of integrated information. Under the IITC, consciousness is explicitly characterized in terms of the capacity of a system [14], and not, following William James [19], as a process. Our new measures imply a Jamesian modification of the IITC by considering consciousness as a process; they also challenge the identity relation between consciousness and integrated information assumed in the IITC. More generally, many other brain-based phenomena are best considered in terms of process rather than capacity, and may admit useful interpretations in terms of integrated information. For example, multi-modal binding and perceptual categorization [20] could involve integrated information in the perceptual domain, and action selection (decision making) [21] may require the integration of sensory, cognitive and motor processes, while retaining differentiation among competing alternatives. In these and other cases, having a measure of integrated information framed in terms of process, that is practically applicable to timeseries data, will permit the formulation of testable hypotheses and synthetic models relating information integration to cognitive and neural operations.

Results
The 'Results' section is organized as follows. In the 'Notation, conventions and preliminaries' section we lay out our notation and introduce some necessary mathematical concepts. In the section 'The previous measure, W DM ' we review W DM using our current notation, noting its limitations especially with respect to discrete Markovian systems. The section 'The new measure, W E ' describes the new measure W E and provides practical recipes for its computation either numerically from time-series or analytically, given a generative model of the system, both under Gaussian assumptions. We note that for non-Gaussian systems W E remains well-defined even if it is more challenging to calculate. The section 'W E for Markovian Gaussian systems' presents the results of various simulations, designed to illustrate the in-practice applicability of W E and to explore its properties. We compute W E for some canonical networks, optimize connectivity under simple dynamics, and examine the numerical stability of the measure. We also compare W E with a version of W DM modified to apply to continuous systems, showing quantitative congruence in most cases. The section 'Extension to multiple lags and to MVAR p ð Þ processes' describes some additional simulation results, showing how W E can measure integrated information over arbitrary timesteps (lags). In the section 'Auto-regressive W (W AR )' we describe W AR and explain its derivation in terms of relations among conditional entropy, covariance, and linear regression prediction error. We demonstrate the utility of W AR by calculating integrated information for representative systems animated by exponentially distributed (i.e., non-Gaussian) dynamics.

Author Summary
A key feature of the human brain is its ability to represent a vast amount of information, and to integrate this information in order to produce specific and selective behaviour, as well as a stream of unified conscious scenes. Attempts have been made to quantify so-called 'integrated information' by formalizing in mathematics the extent to which a system as a whole generates more information than the sum of its parts. However, so far, the resulting measures have turned out to be inapplicable to real neural systems. In this paper we introduce two new measures that can be applied to both realistic neural models and to time-series data garnered from a broad range of neuroimaging and electrophysiological methods. Our work provides new opportunities for examining the role of integrated information in cognition and consciousness, and indeed in the function of any complex biological system. However, our results also pose challenges for theories that ascribe a direct physical meaning to any version of integrated information so far described.

Notation, conventions and preliminaries
We use bold upper-case letters to denote multivariate random variables, and corresponding bold lower-case letters to denote actualizations of random variables. Matrices are denoted by upper-case letters. The n-dimensional identity matrix is denoted by I n and the n-dimensional square matrix of zeros by O n . The transpose operator is denoted by ' T ', and the determinant by 'det'. Our convention for logarithms is to take them to the natural base e, and to denote them by 'log'.
Let X~X 1 , . . . ,X n À Á T be a random variable that takes values in the space V X . Then we denote the probability density function by P X , the mean by x x and the n|n matrix of covariances, cov X i ,X j ð Þ, by S X ð Þ. Let Y~Y 1 ,Y 2 , . . . ,Y m À Á T be a second random variable. Then we denote the n|m matrix of cross- The following quantity will be useful: We call this the partial covariance of X given Y, and it is well-defined when S Y ð Þ is invertible. If X and Y are both multivariate Gaussian variables then the partial covariance S XjY ð Þ is precisely the covariance matrix of the conditional variable XjY~y, for any y: Entropy H characterizes uncertainty, and is given by if Y is discrete; for continuous Y replace the summation by integration. The mutual information I X; Y ð Þbetween X and Y is the average information, or reduction in uncertainty (entropy), about X, knowing the outcome of Y: Mutual information can also be written in the useful form from which it follows that mutual information is symmetric in X and Y [16]. If X and Y are both Gaussian, All these quantities are straightforward to compute empirically from the empirical covariance matrices S X ð Þ and S X,Y ð Þ, and the expression (0.1).
The Kullback-Leibler (KL) divergence D KL P X jjP Y ð Þis a (nonsymmetric) measure of the difference between two probability distributions P X and P Y (well-defined when the variables take values in the same space, V X~VY ). It is given by if the variables are discrete, or if the variables are continuous. We examine integrated information generated by systems of interconnected dynamical elements. We use the letter X to denote such a system, and the number of elements in the system is denoted by X j j. A partition P~M 1 , . . . ,M r È É divides the elements of X into non-overlapping, non-trivial sub-systems, X~M 1 |M 2 | Á Á Á |M r . The state of X at time t is a X j jdimensional random vector denoted by X t , with entries corresponding to states of individual elements of X . Time is discretized, so t takes integer values. We denote the set of possible states of X by S X , and the size of this set by S X j j. Analogous notation is used for the states of sub-systems of X .
A stationary system is one for which the probability density function for X t does not change with time t. For such systems S X ð Þ denotes the stationary covariance matrix, and C t X ð Þ the auto-covariance matrix with time-lag t: The previous measure, W DM In this section we review, following Ref. [11], the most recent version of W within integrated information theory, using our current notation. This measure, which we call W DM ('W-discrete/ Markovian'), was defined for discrete, Markovian systems, i.e. systems with (i) a discrete set of possible states, and (ii) dynamics for which the current state depends only on the state at the previous time-step. After laying out the formal description of W DM , we briefly discuss these limitations, which motivate our new measures W E and W AR .
Let X be a discrete, Markovian system. W DM compares the information generated by the whole system to information generated by its parts, when the system transitions to a particular state X 1~x from a preceding state X 0 characterized by the maximum entropy distribution for the system. This is performed by use of KL divergence to compare (i) the conditional probability distribution for the preceding state of the whole given the current state; (ii) the joint distribution for the preceding states of parts given their respective current states.
The effective information, Q DM X ; x,P ½ , generated by X being in state x, with respect to the partition P~M 1 , . . . ,M r È É , is given by Here m k is the state of the k th sub-system of the partition when X has state x.
To specify the probability distributions in (0.14), one must use Bayes' rule. For the distribution of the whole system the formula is Here P X 0 x' ð Þ is the maximum entropy distribution, so for all possible initial states x'[S X . P X 1 jX 0~x ' is the conditional probability density for the state at time t~1 given that the state at time t~0 is x'. Given a generative model of the system, this distribution can be derived analytically by examining the transitions allowed by the model. In the absence of a generative model the distribution can be obtained by empirical measurement of the equivalent distribution P X t jX t{1~x ' . Note that in neither case is perturbation of the system required, although in the latter case the system must visit all possible states multiple times to allow reasonable estimation of P X t jX t{1~x ' . Finally, the denominator P X 1 x ð Þ is computed from For a part M the analogous Bayes' rule formula is Here P M 0 is the maximum entropy distribution on S M . To compute the conditional probability distribution P M 1 jM 0~m ' for the state at time 1 given the state at time 0 it is necessary to average over states external M. Let N denote the complement of M within X , so X t~M t ,N t ð Þ T . Then we have 19Þ (Note that in Ref. [11] Q DM is instead computed using a perturbed version of the sub-system M, for which the joint distribution of the noise in all the afferent connections ('wires') to M is taken to be maximum entropy. Here we instead assign the maximum entropy distribution to states external to the sub-system. By doing so, we eliminate the step of perturbing sub-systems, and need only perturb the whole system once, namely to impose the maximum entropy distribution as the initial state of the whole system. This choice enables simpler notation and description and does not affect the qualitative behavior of the measure [11].) Finally, P M 1 m ð Þ is given by Given the probability distributions P X 0 jX 1~x and P M k 0 jM k 1~m k , k~1, . . . ,r, the effective information is computed using the formula (0.11) for the KL divergence.
The integrated information is defined as the effective information with respect to the minimum information partition (MIP). The MIP, P MIP x ð Þ, is defined as the partition that minimizes the effective information when it is normalized by Normalization is necessary because sub-systems that are almost as large as the whole system typically generate almost as much information as the whole system. Therefore, without normalization, most systems would have a highly imbalanced MIP, (e.g., one element versus the remainder of the system) and a trivially small value for integrated information. The normalization K M ensures that integrated information is specified using a partition defined using a weighted minimization of the effective information, with a bias towards partitions into sub-systems of roughly equal size. We will discuss the importance of normalization further in the section 'W E for Markovian Gaussian systems'. Thus, P MIP x ð Þ is given by Given the MIP, the integrated information W DM X ; x ð Þgenerated by the system X entering state x is simply the non-normalized effective information with respect to the MIP, Importantly, the value of W DM X ; x ½ is furnished by the nonnormalized effective information because it is supposed to represent a physically meaningful property of the system in the corresponding 'integrated information theory' [14].
For a state-independent alternative to W DM , one can replace the effective information with its expectation with respect to the current state x, and define the expected integrated information, W W DM , as the expected effective information across the partition that minimizes the normalized expected effective information [11]. The expected effective information, Q Q DM , is given by [ Note that the second expression (0.26), but not the first (0.25), requires that X 0 have the maximum entropy distribution [11]. To derive (0.26) from (0.25), one uses that the maximum entropy distribution is uniform, so that This ensures that one can add H X 0 ð Þ to the second term on the RHS of (0.25) and subtract P k H M k 0 À Á from the first term, and then use Eq. (0.6) to obtain the expression (0.26).
We emphasize that W DM was defined only for systems that are both discrete and Markovian. The measure can not be applied to continuous systems (except those with a compact i.e. closed and bounded set of states) because there is no uniquely defined maximum entropy distribution for a continuous random variable defined on the real number line [16]. (In fact, the measure is also not applicable to discrete systems with an infinite set of states.) W DM can only be applied to Markovian systems because for a non-Markovian system it is not clear how to impose the maximum entropy distribution as an initial condition, implying that the conditional probability distribution P X1jX0~x' cannot be uniquely specified by any generative model. For instance three alternatives are (i) to make all past states independent and maximum entropy; (ii) to set all past states to zero except the most recent; (iii) to just set one past state to maximum entropy and obtain the distribution for other past states from the generative model. There is no immediately apparent way to choose among these alternatives. Taken together, these limitations are important because complex (e.g. neural) systems are typically non-Markovian, and neural signals are often recorded as continuous variables. In 'Methods' we describe an extension to W DM that renders it well-defined for stationary continuous, but still Markovian, systems by choosing a maximum entropy distribution based on the stationary variances of the states of individual elements. This enables us to compare W DM with our new measure W E for some example cases.

The new measure, W E
The general case. In this section we introduce a new measure of integrated information, W E , constructed analogously to W DM , but with modifications to broaden its applicability, both in theory and in practice. W E is designed for stochastic stationary systems, for which it provides a single time-and state-independent value (given a timescale of measurement, discussed below). The measure is particularly easy to apply to stationary Gaussian systems, either from time-series data or from a generative model.
The key modification is that rather than measuring information generated by transitions from a hypothetical maximum entropy past state, W E instead utilizes the actual distribution of the past state; hence the name W E , 'W-empirical'. This ensures that the measure does not suffer from the in-principle restrictions that pertain to W DM , and can be applied to both discrete and continuous systems with either Markovian or non-Markovian dynamics. (More specifically, W E will be well-defined as long as the states X t of the system are either discrete or have continuous probability densities with respect to a Lebesgue measure d n x.) A second difference is that, in order to be state-independent, W E is based on the average information generated by the current state about the past state, as opposed to information generated by a particular current state. Finally, W E is defined so as to enable a choice of timescale (indicated by t) over which integrated information is measured. Thus W E X ; t ½ is the integrated information generated by the current state of the system about the state t time-steps in the past.
We now define W E for a stochastic system with stationary dynamics. As for W DM , W E is defined via 'effective information'. For the new measure we define the effective information generated by the current state X t about the state t time-steps ago, with respect to bipartition B~M 1 ,M 2 È É , to be the mutual information generated by the whole system minus the sum of the mutual information generated by the parts within the bipartition. Thus The integrated information W E X ; t ½ is then the non-normalized effective information with respect to the minimum information bipartition (MIB), W E can either be computed analytically from a generative model, or estimated numerically from time-series data. In either case, one must first obtain estimates of the probability distributions for the states X t{t and X t , and their joint distribution P Xt{t,Xt ð Þ T , as well as the corresponding distributions for all sub-systems. Then, given these distributions, the corresponding entropies can be computed using Eq. (0.3), for a system with discrete states, or Eq. (0.4) for a system with continuous states. Having obtained these entropies, Eq. (0.7) can be used to obtain the mutual information I X t{t ; X t ð Þbetween the past and current state of the system, and likewise for all sub-systems. Given these quantities, W E can then be obtained directly from Eqs. (0.28)-(0.31).
For numerical computation, the required probability distributions can in principle be obtained directly from data, although in practice it may be difficult to obtain sufficient data to enable accurate estimation of all the relevant entropies. As we explain in the section 'Computing W E empirically under Gaussian assumptions', this difficulty can be readily overcome if states are Gaussian distributed.
For analytic computation of W E given a generative model, we note that the probability distributions for X t{t and X t individually are both simply equal to the stationary distribution for the state of the system. Obtaining the joint distribution for X t{t and X t together will depend on the details of the generative model. Once again the situation is much easier in practice for Gaussian systems, in which case only the covariance matrix of each probability distribution is needed (see equation (0.8)). As we show in the section 'Computing W E analytically for a Gaussian system', these matrices can be derived easily from a generative model expressed as a generalized connectivity matrix, assuming Gaussian dynamics.
A few further remarks about W E are worth making. First, that W E remains well-defined as a time-dependent quantity for nonstationary stochastic systems; we focus on the stationary case for simplicity, and because of our interest in empirical measurement of W E via sampling from time-series data. Second, unlike W DM , W E is not defined for deterministic systems. This is because it does not incorporate a perturbation through which to introduce probabilities into a deterministic system. Third, we restrict attention to bipartitions for computational efficiency. This is standard practice for computing W DM [11,14]. Extension to general partitions is trivial, albeit computationally expensive. Finally, since mutual information is symmetric in its two arguments (0.7), effective information as given by (0.28) can alternatively be read in terms of information generated by the past state X t{t about the current state X t .
Our definition (0.28) for the effective information, Q, is based on the expression (0.26) for the expected effective information, Q Q DM in the construction of W DM . A viable alternative would be to instead usẽ the analogue of (0.25). This quantity has previously been defined in Ref. [22] as 'stochastic interaction. It is the average KL divergence between (i) the past of the whole given the present of the whole, and (ii) the product of this for parts [11]. Replacing Q withQ Q in the definition of W E leads to a second measureW W E . In general,Q Q will not be exactly equal to Q. (Equality of their analogues for W DM relies on the past state being maximum entropy, see section 'The previous measure, W DM '.) However, we show in Table 1 thatW W E behaves very similarly to W E for the examples we consider in this paper. We choose to focus on W E because it explicitly operationalizes the concept of 'information generated by the whole minus the sum of information generated by the parts' (0.28).
In summary, we have defined a new measure of integrated information W E that is broadly well-defined, and which is easy to measure under Gaussian dynamics, either from time-series data or given a generative model (see below). In contrast, the previous measure W DM is only defined for discrete, Markovian systems. As a consequence, W E but not W DM is applicable to realistic continuous non-Markovian stochastic models of neural systems.
Computing W E empirically under Gaussian assumptions. Under Gaussian assumptions, equation (0.10) furnishes an expression for W E simply in terms of covariance matrices, enabling straightforward empirical computation. The effective information is given by and the normalization factor K by : ð0:34Þ In practice, the procedure for computing W E is as follows. First one obtains empirically the covariance matrices S X ð Þ, S X t{t ,X t ð Þ and analogues for all sub-systems. Then one uses Eq. (0.1) to obtain the partial covariance S X t{t jX t ð Þ and its sub-system analogues. Given these quantities, equations (0.33) and (0.34) furnish estimates for the effective information and normalized effective information with respect to any given bipartition. These estimates allow identification of the MIB and W E , via equations (0.29) and (0.30).
Computing W E analytically for a Gaussian system. In this section we describe analytical computation of W E for Gaussian systems, assuming that the generative model is known. We first recognize that a generative model for a Gaussian stationary system is always equivalent to an MVAR p ð Þ (multivariate auto-regressive) process [18] ð0:35Þ where the A i , i~1, . . . ,p, can be understood as generalized connectivity matrices acting at different time-lags, and E t is a stationary multivariate Gaussian 'white noise' source with zero mean and vanishing auto-covariance function, C t E ð Þ~0, t=0. (Technically, there also exists the case p~?, but we do not consider this here, because in practical application there will always be an optimal range of finite p for model fitting.) Below, we show how to calculate W E for an MVAR(1) system at timescale t~1. Extension to the general p, general t case is given in the 'Methods' section. Consider the generative model ð0:36Þ Taking the covariance of both sides of (0.36) gives Noticing that this equation is the discrete-time Lyapunov equation, S X ð Þ can be computed numerically, given A, for example, in Matlab via use of the 'dlyap' command. To compute the partial covariance S X t{1 jX t ð Þ we need the single time-step autocovariance matrix We can then use equation (0.1) to obtain the partial covariance as Having values for S X ð Þ and S X t{1 jX t ð Þallows calculation of the first term in the RHS of (0.33). Calculation of the second term, and of the normalization factor, requires consideration of sub-systems. For a sub-system M, we consider the bipartition M,N f g, and the block decomposition of vectors and matrices according to X t~M t ,N t ð Þ T . The matrices S X ð Þ and C 1 X ð Þ can then be written in the form and we can use that Then, again from (0.1), the partial covariance is given by

W E for Markovian Gaussian systems
Canonical examples. We present results from computing W E , for timescale t~1, for some example Markovian Gaussian systems. Results are given for analytical computation given the generative model, and for numerical computation given simulated time-series data. The example systems are characterized by the MVAR(1) dynamics where X t contains 8 variables, A is the connectivity matrix, and each component of E t is an independent Gaussian random variable of mean 0 and variance 1. We considered seven systems, with connectivity as shown in Fig. 1(a)-(g); we refer to these systems '1(a)', '1(b)', and so on. The corresponding values of W E are given in Fig. 1(h) and Table 1. For analytic computation, we performed the procedure described in the section 'Computing W E analytically for a Gaussian system'. For simulated measurements, we first obtained time-series data from equation (0.43), and then computed W E using the recipe described in the section 'Computing W E empirically under Gaussian assumptions'. To examine numerical stability of simulation measurements, we performed 10 trials for each network with 3000 post-equilibrium data points and a separate set of 10 trials with 10,000 post-equilibrium data points. For all systems, except 1(g) (which we discuss below), the analytically derived (true) value of W E lay within &1 standard deviation of the mean value obtained via the simulations, both for 3000 and 10,000 data points (see Fig. 1(h) and Table 1). This correspondence confirms the consistency of the numerical and analytical approaches described above.
The values of integrated information mostly correspond with expectations. For example, a ring of reciprocal connections (1(c)) integrates approximately twice as much information as a ring of unidirectional connections (1(b)), which itself integrates approximately twice as much information as a (non-closed) chain of unidirectional connections (1(a)). Also as expected, the homogenous system 1(d) has a low W E value. Perhaps in contrast to expectations, adding sparse long-range 'short-cut' connections to a reciprocal ring (1(e)-1(g)), in the style of a so-called 'small world' network [23,24,25], does not increase W E (compare with network 1(c)).
For values of W E to be meaningful it is essential that they are stable with respect to numerical computation. To assess numerical stability, we calculated the coefficient of variation (the standard deviation divided by mean) across each set of 10 trials. For all networks other than 1(g), and for trial sets of both 3,000 and 10,000 data points, the coefficient of variation was less than 0.11, confirming that empirical calculation of W E from time-series data is stable for these networks.
Network 1(g) exhibited instability when measuring W E from simulation. As shown in Fig. 1(h), the corresponding values of W E fell close to one of two values, one of which was the true (analytically derived) value. For simulations of 3,000 data points 6/ 10 trials produced W E estimates close to the true value; for 10,000 data points 4/10 trials provided such estimates. This instability arises from the use of normalized effective information (Q) in identifying the MIB, but non-normalized Q in specifying the corresponding value of W E . Given finite data, estimates of Q cannot be guaranteed to be accurate. As a result, inter-trial variation in measuring W E from data can arise when (i) there are two (or more) partitions with similar values of normalized Q close to the true minimum (used to identify the MIB), and (ii) these partitions have substantially different values for non-normalized Q. The latter condition will typically hold when partitions with similar normalized Q have significantly different sub-system sizes (see the section 'The previous measure, W DM '). Network 1(g) illustrates this difficulty. For this network, the true MIB is the bipartition 1,6,7,8 f g , 2,3,4,5 f g f g , for which the normalized Q is 0.0213. However, there is an uneven bipartition, 1,2,3,4,5 f g , 6,7,8 f g f g for which the normalized Q is 0.0218, i.e., very similar to the value of Q for the true MIB. However, the non-normalized Q for the MIB (i.e. W E ) is 0.1266, whereas the value for the uneven bipartition is 0.0966. Fig. 1(h) and Table 1 show that empirical measurements of W E cluster around these two values.
One may consider that this problem of instability could be avoided by using non-normalized Q to identify the MIB. However, as discussed in the section 'The previous measure, W DM ', in this case W E would always be trivially small because, for any non-trivial system X , a bipartition of the form 1 f g, 2,3, . . . , X j j f g f g would generate almost as much information as the whole system. A second solution would be to specify W E in terms of normalized Q. However, in this case the meaning of W E would be substantially altered inasmuch as it could no longer be considered a measure of the quantity of information generated (or integrated) by a system.
Optimization of networks for generating high W E . To examine whether network structures other than reciprocally connected rings could generate high levels of W E , we performed numerical optimizations using a genetic algorithm (GA). Specifically, we used W E (t~1) as an objective function for evolving populations of networks with dynamics governed by MVAR(1) processes (see Eq. (0.43)). We performed two sets of optimizations under different constraints on the connectivity matrix A. In the first set, all connection strengths were fixed ('fixed' condition; two afferents per element each with strength 0.25). In the second set, connection strengths were allowed to vary ('vary' condition; total afferent to each element equal to 0.5, all afferents to a given element equal and positive). Each condition consisted of 20 separate GAs, each with 30 randomly initialized networks in the population; (in the 'vary' condition networks were initialized with elements having on average 2 afferent connections). Each GA ran for 200 generations, allowing fitness to asymptote. Within each generation, the fitness of each network was determined by analytical computation of W E ; networks were then ranked by fitness and a new population was formed by rankbased selection and mutation. In the 'fixed' condition, each network was mutated by rearranging 2 connections; in the 'vary' condition each network was mutated by (with equal probability) adding, removing, or swapping 2 connections, followed by renormalization of total afference to each element to 0.5.
The results of the optimizations are shown in Fig. 2 and Table 1. Network 2(a) is the fittest (highest W E ) across all 20 GAs in the 'fixed' condition; this network topology was discovered by 6 out of the 20 GAs in this condition. The network has W E~0 :2502, approximately twice the value of the reciprocal ring networks shown in Fig. 1. Network 2(b) is the fittest across all 20 GAs in the 'vary' condition, exhibiting W E~0 :2965, i.e., substantially higher again. This particular topology was discovered by only 2/20 GAs, perhaps due to the larger search-space in this condition. It is noteworthy that both of these 'fittest' networks show highly heterogeneous connectivity patterns, consistent with the intuition that integrated information is characterized by the coexistence of differentiated and integrated dynamics.
The observation that the fittest network found in each condition was only reached by a minority of GAs suggests that the W E landscape across MVAR(1) systems has local maxima and may exhibit ruggedness and discontinuities. To characterize this landscape, we first plotted the distribution of fitness values across all networks in the final populations from GAs that yielded the (fittest) networks 2(a) and 2(b). Figs. 3(a,b) show that in both cases the modal value of W E was substantially less than the maximum value, indicating a lack of convergence suggestive of local maxima and/or ruggedness [26]. We next examined the sensitivity of W E to single mutations. Figs. 3(c) and 3(d) show the percentage decrease in W E following 200 separate mutations of networks 2(a) and 2(b) respectively (the corresponding mutation type was used in each case, i.e., 'fixed' for 2(a) and 'vary' for 2(b)). For network 2(a), postmutation fitness decreases cluster in the range 10-20%, with a few instances of &60%. For network 2(b), more than 20% of mutations resulted in a fitness decrease of 50% or more. Together, these observations show that the value of W E generated by a network is highly sensitive to small changes in topology and connection strength, further pointing to the ruggedness of the W E landscape.
The instability arising from using normalized effective information to find the MIB, (see 'Canonical examples'), suggests that there may be discontinuities, as well as ruggedness, in the W E landscape. We were able to confirm the existence of such discontinuities by incrementally perturbing a specific connection in the example network 2(a). The MIB for this network is the bipartition 1,4,5,6 f g , 2,3,7,8 f g f g , for which the normalized effective information is 0.0421. However, there is an uneven bipartition, 1,2,3,5,7,8 f g , 4,6 f g f g with the very similar normalized effective information of 0.0424. We incrementally weakened the connection between the two sub-systems in this uneven bipartition, finding that there is a discontinuous change in W E at the point at which the uneven bipartition becomes the MIB (see Fig. 3(e)).
Comparison with W DM ,W W E , and full table of MVAR(1) results. It is instructive to compare results obtained using W E with those obtained from the version of W W DM extended to apply to stationary continuous (but still Markovian) systems (see sections 'The previous measure, W DM ' and 'Methods'). Table 1 shows (extended) W W DM values for the various networks discussed above, as well as the corresponding W E values. For networks 1(a) and 1(b) the two measures are exactly equivalent, which is explained by the stationary and maximum entropy distributions coinciding. For the remaining networks, (except network 2(a), discussed below), the two measures remain very similar, confirming W E as a valid and useful measure of integrated information.
The network 2(a) has a value for W E that is approximately double that of the corresponding W W DM . This discrepancy can also be attributed to the instability arising from normalization. Specifically, the difference between the stationary and maximum entropy distributions in this case is sufficient to lead to two different MIBs, with constituent sub-systems of different sizes. In fact, use of W W DM leads to the MIB 1,2,3,5,7,8 f g , 4,6 f g f g of the perturbed version of this network discussed in 'Optimization of networks for generating high W E '.
We also compared results obtained using W E with those obtained usingW W E , the measure constructed using the alternative expression (0.32) for the effective information (Table 1). We found that the two measures behave in qualitatively the same way across all examples.

Extension to multiple lags and to MVAR p ð Þ processes
The analyses in the previous section were concerned with integrated information measured across a single time-step for MVAR(1) processes. However, W E is well-defined for general MVAR p ð Þ processes and can measure integrated information over any number of time-steps (lags). Here we illustrate this property using three simple examples in which W E was computed analytically, via the method outlined in 'Computing W E analytically for a Gaussian system' and 'Methods'. Fig. 4(a) shows W E measured for various values of t, (where t specifies the lag), for the network 1(c). Fig. 4(b) shows the same analysis conducted for network 2(b). Note that both of these networks are animated by MVAR(1) processes, which explains why W E peaks at t~1 in both cases, (in other words, for these networks, most of the integrated information generated about past states by the current state is generated about the most recent past state (i.e. t~1)). Fig. 4(c) shows W E as a function of t for the MVAR(3) process X t~A1 : X t{1 zA 2 : X t{2 zA 3 : X t{3 zE t , ð0:44Þ where A 1 , A 2 and A 3 are respectively the connectivity matrices of networks 1(c), 2(b) and 2(a), each divided by 2. Note that this generalized connectivity matrix was chosen purely to provide an example of an MVAR(3) process. For this system, W E peaks at t~2, indicating that most information is integrated about the state two time-steps previous to the current state. These examples verify that W E can be applied at arbitrary lags to MVAR p ð Þ processes, and that it does detect integrated information at time-scales corresponding to a system's underlying generative mechanism.

Auto-regressive W (W AR )
We have presented a measure of integrated information, W E , that is practical to measure from time-series data under Gaussian assumptions. However, in the case of stationary, non-Gaussian distributed time-series, W E can no longer be obtained directly from empirical covariance matrices, and the required entropies must be obtained via estimation of the corresponding probability distributions. For non-trivial systems accurate entropy estimation may typically require the collection of more data than is practical.
We now describe how, even for the non-Gaussian case, the recipe used to calculate W E under Gaussian assumptions can nonetheless lead to a meaningful quantity reflecting integrated information. We call this quantity W AR ('auto-regressive W'). By construction, W AR is equivalent to W E for Gaussian systems, however, for non-Gaussian systems it may differ. In all cases, because it is based on empirical covariance matrices, it remains easy to measure in practice. The motivation for considering W AR as a useful measure of integrated information rests on relations between conditional entropy, partial covariance and linear regression prediction error, explained below [17].
First we rehearse the concept of linear regression. Let X and Y be two multivariate random variables. Then the linear regression of X on Y is the expression where A is termed the regression matrix, a is a vector of constants, and E is the prediction error (or 'residual') [27,28,29,17]. The residual is a random vector uncorrelated with Y. This representation is unique given the distributions of X and Y, with A and a given by where n is the dimension of X. This relation between conditional entropy and linear regression prediction error implies that, for Gaussian systems, W E can be re-expressed in terms of linear regression prediction errors. Thus, the formula (0.33) for effective information can be re-written as where E M k , k~1,2, and E X are the residuals in the regressions (0.51) and (0.52). Then W AR is simply Q AR for the bipartition that minimizes Q AR divided by the normalization factor : ð0:54Þ Thus, W AR is meaningful as a measure of integrated information because of its formulation in terms of linear regression prediction error. W AR compares the whole system to the sum of its parts in terms of the log-ratio of the variance of the past state to the variance of the residual of a linear regression of the past on the present. In other words, W AR can be understood as a measure of the extent to which the present global state of the system predicts the past global state of the system, as compared to predictions based on the most informative decomposition of the system into its component parts. When Gaussian conditions are satisfied, the interpretation of W AR in terms of (backwards) prediction becomes exactly equivalent to the interpretation of W E in terms of Shannon information. Note that in fact, by the symmetry of mutual information (0.7), (0.28), W AR could also be expressed in terms of entirely analogous linear regressions in which the present is used to predict the future. Understood this way, W AR provides an interesting complement to complexity measures based on Granger causality, such as causal density [5], which are also based on linear regression models [30,5,18] (see 'Comparison with causal density and neural complexity').
To demonstrate the use of W AR as distinct from W E , we reanimated the networks 1(a)-1(g), 2(a) and 2(b) with non-Gaussian dynamics. Specifically, we replaced the Gaussian noise sources E t in Eq. (0.43) with independent random variables drawn from exponential distributions with mean (and variance) 1. This selection was motivated by the observation that aggregate assemblies of Poissonian spiking neurons typically follow an exponential distribution [31]. Fig. 5 shows representative examples of single-element empirical stationary distributions resulting from this modified dynamics; all show a large deviation from the Gaussian. For each network we computed W AR empirically from 10 trials of 3000 data points each. The results, shown in Table 1, suggest that in each case W AR for the non-Gaussian dynamics is approximately equal to W E (~W AR ) for the Gaussian dynamics. This finding provides support for W AR as a useful alternative to W E , applicable to non-Gaussian dynamics.

Discussion
In this paper we have presented two new measures of integrated information, W E and W AR . As with a previous measure, W DM , our measures quantify the information generated by a system over and above that which can be accounted for by its parts acting independently [11]. However, whereas W DM is defined only for discrete Markovian systems, and is therefore difficult to measure in practice, our quantities are well defined much more generally, and are easily applicable to stationary time-series data. Our key innovations are (i) to treat information in terms of reduction in uncertainty from the empirical as opposed to the maximum entropy distribution (W E ), and (ii) to interpret integrated information in terms of predictive ability of the present of a system with respect to its past (W AR ). Simulations showed that our measures conform to intuitions regarding conjoined dynamical integration and segregation; where comparisons could be made, in most cases our measures quantitatively aligned with W DM . By showing how to measure integrated information from time-series data and for non-trivial non-Markovian systems, our results provide new opportunities for examining the role of integrated information in complex biological systems of all kinds, and carry implications for integrated information theories of consciousness. In the following discussion, we use the symbol W to refer to integrated information independently of its method of measurement.

Empirical and maximum entropy distributions
As mentioned, many of the restrictions in applicability of W DM arise from the use of the maximum entropy distribution to measure information. The maximum entropy distribution is maximally agnostic with respect to the behavior of a system, and represents, in some sense, its potential, or 'capacity' (see 'Integrated information as a measure of consciousness' and 'Comparison with causal density and neural complexity'). However, since the maximum entropy distribution typically does not arise spontaneously, it must be introduced as the distribution of a hypothetical initial state [11]. To compute W DM one therefore has to characterize evolution from all possible initial states of the system. However, for most practical purposes, especially in biology, it is only possible to experimentally examine systems in the context of their ongoing evolution as a sequence of states. Unless the system is Markovian, evolution from a state with history is not the same as evolution from a hypothetical initial state, implying that W DM cannot be applied to non-Markovian systems (with the exception of idealized simulated systems for which a separate generative model can be written down for evolution from the initial state). Equally important, but easier to appreciate, is that it is not possible to apply W DM to continuous systems (except those with a compact, i.e. closed and bounded, set of states) because there is no uniquely defined maximum entropy distribution for a continuous random variable defined on the real number line [16].
Our new measure W E eliminates the need to consider the maximum entropy distribution by being based instead on the information generated by the current state of the system about the actual state of the system some number of time-steps in the past. This approach lifts the conditions that the system be discrete and Markovian. (Note however that W DM but not W E is applicable to deterministic systems, by virtue of introducing probabilities via the maximum entropy initial state.) In principle, use of the empirical distribution de-emphasizes the notion of 'capacity' because the generation of information is measured with respect to what the system has done rather than what it could do. However, over large samples and for ergodic systems, this distinction becomes increasingly blurred. In practice, computing W E via sampling from time-series requires the data to be stationary. We recognize that not all complex biological systems generate stationary dynamics (see, e.g., Ref. [32]). However, stationarity is a common pre-requisite for statistical analysis of time-series data [33], and neural data can often be brought into this form, for example by detrending, taking first-differences and/or binning observations into short time windows [34]. Furthermore, neural dynamics are often characterized as a series of 'metastable' states [35,36,37], each of which may be locally stationary. Stationarity can also depend on the spatiotemporal granularity of observation. Dynamics that appear non-stationary at one time scale may exhibit stationarity when sampled over different time scales, underlining the principle that data acquisition should be guided by the constraints of subsequent analysis methods.
Use of the empirical, rather than maximum entropy distribution also changes the means by which W is computed. To compute W DM , one requires the conditional probability distributions for the past state given the present state, but with an a priori maximum entropy distribution on the past state. Because of the maximum entropy condition (which represents 'perturbation' of the system), these distributions cannot be obtained empirically, but they can be obtained by applying Bayes' rule given a forward dynamical model estimated from the data (i.e. conditional probability distributions for the present state, given the past state). By contrast, computation of W E does not require Bayes' rule because, in the absence of (maximum entropy) perturbation, one can obtain the full joint distribution for the past and present directly from the data.

Practical applicability and Gaussian dynamics
W E is particularly easy to apply to data under Gaussian assumptions. This is because the relevant entropies can be estimated directly from empirical covariance matrices. It is also possible to compute W E analytically from a generative model for a Gaussian system, (i.e., to any desired level of accuracy, without explicitly simulating or observing its dynamics); in that case, one obtains the necessary covariance matrices analytically. This means that W E can be evaluated in practice for a broad range of biological systems.
While Gaussian dynamics are common in biology (and the assumption of Gaussianity even more so), many systems depart from this assumption. For example, the spiking activity of populations of neurons typically exhibit exponentially distributed dynamics. For the non-Gaussian case, one can still in principle calculate W E by obtaining the necessary entropies directly from data. However, in practice, accurately obtaining all of the underlying probability distributions may typically require the collection of more data than is practical. To overcome this, we introduced the second measure W AR . This is constructed analogously to W E , but with information replaced by the reduction in the generalized covariance of the past state under prediction via linear regression on the current state. W AR is interpreted as measuring how well the present state of a system predicts some previous state, but only to the extent that predictions based on the whole outstrip predictions based on parts independently. W AR and W E are equivalent for Gaussian systems, but otherwise differ; (recall however that W AR can be obtained for any system by using the recipe for computing W E for a Gaussian system). In our examples, W AR was in fact insensitive to a change from Gaussian noise to exponentially distributed noise, supporting its use as an alternative to W E .

Normalization and instability
All versions of W require a normalization step. Specifically, W is determined by the non-normalized effective information (Q) across a minimum information bipartition (MIB) which is specified as the bipartition which minimizes the normalized Q (the informational 'weakest link'). Normalization enforces a bias towards bipartitions consisting of sub-systems of roughly equal size. Without normalization, MIBs would typically divide systems into single elements versus the remainder of the system, leading to trivially small values of W. On the other hand, it remains important to determine the value of W using the non-normalized Q in order to allow W to be interpreted as a quantity of information.
The use of normalization, as just described, leads to instabilities. Our simulations have shown that W E can be (i) discontinuous under a continuous perturbation of dynamics, and (ii) highly sensitive to the accuracy of entropy estimation from finite data. In our examples, these instabilities arose precisely when there were multiple partitions with similar values of normalized Q close to the true minimum and these partitions had substantially different values of non-normalized Q. This instability does not arise for all systems, and indeed for most of our examples W E is numerically stable. Nonetheless, the embedding of normalization within the definition of W challenges ascription of physical meaning to any measured value of W. This is because the value of W is in all cases dependent to some arbitrary degree on the normalization process involved in determining the MIB.

Integrated information as a measure of consciousness
Previous measures of integrated information (W C and W DM ) were formulated in the context of a theory of consciousness, the 'integrated information theory of consciousness' (IITC). According to the IITC, consciousness is integrated information, and has the status of a fundamental property of the universe, equivalent to mass, charge, and the like [14]. On this theory a low value of integrated information would correspond to a low conscious 'level' (e.g., coma, general anesthesia, deep dreamless sleep) and a high value to normal conscious wakefulness. If one subscribes to the theory using W DM , then one must interpret consciousness (integrated information) as a function of state transitions [11]; accordingly, one cannot ask about the conscious level of a system per se. By contrast, if one applies W E or W AR to a stationary system then they are state-independent and so, subscribing to the IITC with these measures involves viewing integrated information as a property of the system's dynamics. This in turn would imply that (i) conscious level is constant during each stationary epoch in brain activity, and (ii) conscious level changes when functional connectivity changes, modifying the stationary statistics. This view recalls William James' notion of consciousness as a process [19] and is consistent with a large amount of empirical evidence showing correlations between conscious level and plausibly stationary epochs of brain activity. For example, normal conscious wakefulness is characterized by low-amplitude high-frequency oscillations in the cortical EEG [38], whereas epileptic absence seizures are characterized instead by increased synchrony in thalamocortical systems [39]. As mentioned in the section 'Empirical and maximum entropy distributions', neural dynamics may be metastable [35,36,37], with locally stationary periods corresponding to a conscious state with a particular level and content. Our results now make it possible to measure the integrated information corresponding to these various states and to compare these values with other indices of consciousness, both subjective (e.g., verbal reports, confidence ratings, etc.) and objective (e.g., EEG synchrony, widespread brain activity, etc.) [40]. Importantly, it is now possible to quantitatively compare integrated information with other measures of neural dynamics that operationalize in different ways the notion that consciousness conjoins dynamical integration and differentiation, such as 'causal density' [41] and 'neural complexity' [8] (see 'Comparison with causal density and neural complexity').
An important feature of the IITC as previously expressed is that consciousness qua W is best considered as a capacity (equivalently a potential, or disposition), and not as an 'object' or a process [14]. The original W C operationalized the notion of capacity by subjecting a system to all possible perturbations and examining its responses. The recent W DM measures information as a reduction in entropy from the maximum entropy distribution, which can be taken to correspond to the capacity of a system. However, because W DM is specified by state transitions it is not a 'pure' measure of capacity; rather, it is a measure of capacity modulated by a system's dynamics. By measuring W with reference to the stationary distribution, our measures depart from the notion of consciousness as a capacity. The stationary distribution characterizes the capacity of a system only to the extent that it is realized in the system's behaviour. W E and W AR can therefore be construed as measures of a process modulated by capacity, aligning more closely with the Jamesian intuition.
The notion that W exists as a 'fundamental property' deserves comment. As described in the section 'Normalization and instability', our results challenge the ascription of physical meaning to W, in virtue of its exquisite sensitivity to the normalization process involved in specifying the MIB: this challenge pertains equally to the notion of W as a 'fundamental quantity'. A further challenge to the ascription of physical meaning to W is the fact that it is not invariant under a change of coordinates, since this leads to a different set of sub-systems over which to minimize the effective information. An interesting question for future work is to examine whether, under certain conditions, the set of coordinates that maximizes W could be taken to define 'natural' coordinates, or macroscopic variables, for the system. In any case, it does not seem necessary to consider W as a strict physical quantity in order to measure the integrated information corresponding to a system's state transitions or stationary dynamics, nor to relate these measurements to conscious level and content. In other words, one can depart from the IITC by interpreting W as accounting for particular aspects of consciousness without the further step of claiming identity [9].

Integrated information in other neurocognitive processes
Although W was originally developed in the context of a theory of consciousness, it is plausible that integrated information, and (more generally) conjoined functional integration and differentiation, play key roles in other cognitive and neural processes. Previous formulations (W DM , W C ) are poorly suited to investigating these roles, not only because of practical inapplicability, but also because they characterize integrated information in terms of capacity rather than process. Whereas consciousness under some theories may be considered as a capacity (see above), neurocognitive properties in general are best considered as processes. Having a measure of W that is framed in terms of process, and that is easy to apply in practice, therefore permits the framing of testable hypotheses, and the specification of synthetic models, aimed at examining the role of integrated information in neurocognitive processes broadly construed. For example, multimodal binding and perceptual categorization [20], and action selection (decision making) [21] plausibly involve integrated information and could be profitably analyzed using our methods. Already, related measures of dynamical complexity (neural complexity and causal density, see below) have been correlated with the ability of simulated agents to deploy flexible behavior, suggesting a role for such dynamics in sensorimotor coordination in rich environments [6,41]. Our results now allow integrated information to be applied in similar situations, facilitating comparative analyses.
Comparison with causal density and neural complexity W is one among a family of recent measures that aim to characterize, in different ways, the coexistence of integration and differentiation in a system's dynamics. Two alternative measures are 'causal density' [41] and 'neural complexity' [42]. Here, we briefly summarize the similarities and differences among these measures, in order to set W into a broader context.
Causal density, like W AR and W E (but in contrast to W DM and W C ), is a measure of process rather than capacity. In virtue of being based on 'Granger causality', it also shares with W a sensitivity to causal interactions within a system. A key difference, however, is that causal density is based on all causal interactions, and not just those across a particular partition; thus causal density avoids the normalization problems described above ('Normalization and instability'). Briefly, Granger causality is a statistical measure of causal influence which asserts that a variable X 1 'Granger causes' another variable X 2 if information in the past of X 1 helps predict the future of X 2 , above and beyond information already in the past of X 2 (and, optionally, in the past of a set of conditioning variables X 3...N ) [30,43]. Causal density is then the (weighted) fraction of causal interactions among all elements that are statistically significant. High causal density indicates that elements within a system are both globally coordinated in their activity (to be useful for predicting each others' activity) and at the same time dynamically distinct (so that different elements contribute in different ways to these predictions). Granger causality (and causal density) is typically calculated using linear auto-regressive models, which brings about an interesting comparison with W AR . In a loose sense, integrated information, as measured by W AR or W E , can be thought of as a variety of 'causal density', that quantifies the strength of the weakest bidirectional causal link between any two halves of the system. Forthcoming work will investigate further the links between W AR and causal density.
Neural complexity is calculated as the sum of the average mutual information across all bipartitions of a system [42]. Unlike W and causal density, it does not reflect causal interactions within a system, however, like causal density, it is a measure of process rather than capacity. Neural complexity is maximal in a system that is globally integrated at the level of large subsystems, while exhibiting a high degree of segregation between smaller subsystems. (Note: The original papers describing neural complexity contained an error in calculating the covariance matrix from a generative model, which has been subsequently corrected in [44]. However, it appears that this error may still affect extant calculations of W C .) A recent result [17] showing an equivalence between Granger causality and 'transfer entropy' (a time-directed version of mutual information) allows causal density to be related directly to neural complexity. Specifically, one can define a 'bipartition causal density' as a weighted average Granger causality (transfer entropy) across all bipartitions of a system (this definition also requires extension of Granger causality to multivariate variables) [18]. This measure furnishes a 'timedirected' version of neural complexity based on transfer entropy rather than mutual information.
These relations together suggest common foundations for measures of coexisting integration and differentiation. However, further work is needed to fully establish their theoretical interdependencies and their empirical convergences and divergences.

Comparison with other measures
Characterizing complexity is a diverse field, and there are other measures that capture complex properties other than conjoined differentiation and integration. For example, 'thermodynamic depth' [45] can be interpreted as a measure of how hard it is to put a system together, and is based on the joint entropy of all past states, given the current state. W by contrast considers only one past state. An interesting further modification to W could involve information between the present and the whole past trajectory of the system. Another measure of statistical interdependence, 'informational coherence', considers the optimal predictive state for each time-series, and then measures mutual information between these [46]. In related work by Ay et al., the whole system is compared to the sum of individual elements [22,47,48], and the analysis goes beyond examination of conditional entropies to a more thorough mathematical treatment in terms of information geometry. While it is beyond the present scope to examine the formal correspondences among these measures, other related measures, and the measures described above, the growing interest in quantitative measures of complexity further emphasizes the need to formulate theoretically principled measures that are also simple to apply in practice.

Limitations and extensions
Although our measures represent substantial improvements in practical applicability of measures of integrated information, several limitations remain. Most prominently, the normalization procedure leads to instabilities in the measurement process and undercuts ascription of physical meaning to W. Addressing this problem stands as a key theoretical challenge. We have only considered application of our measures to stationary dynamics. Future work may extend consideration to non-stationary (but still continuous and non-Markovian) processes, potentially capturing important non-stationary aspects of neural dynamics. In addition, our measures are applicable only to stochastic systems. While extension to closed deterministic systems may be of some value, most complex biological systems have stochastic components, especially when considered in interaction with a (stochastic) environment [49,50]. Finally, our measures share with previous measures the computational challenge posed by the combinatorial explosion in partitions of a system as the number of elements increases. Possibly, imposing priors on the search for the minimum information partition may mitigate this challenge.
We have only considered a first-order, linear approximation for computing entropies/information from data. While this is useful for drawing comparison with Granger causality and causal density, there now exist more advanced approximation techniques that could be used in future work, for example additive regression [51] or kernel regression [52]. Regarding estimation of entropy and mutual information without employing a regression model, we have only considered this via the intermediate step of density estimation. Again, future work could investigate the applicability of more advanced techniques [53,54] that avoid this step.
As well as addressing the above challenges, future work will (i) empirically examine integrated information for time-series data acquired from neuroimaging and other biological datasets, in order to test intuitions regarding consciousness and other neurocognitive processes; (ii) investigate in models how integrated information is modulated by input and output relations of a system embedded in, and interacting with, a surrounding environment, and (iii) determine theoretically the relations between integrated information and alternative measures of dynamical complexity and metastability.

Methods
Text S1 in 'Supporting Information' contains software enabling calculation of WE and WAR, as well as functions which allow regeneration of some of the simulations we describe.

Extension and computation of W DM for an MVAR(1) process
To extend W DM to stationary continuous Markovian systems, we have to address the problem that there is no well-defined maximum entropy distribution for such systems. We do this by replacing the 'maximum entropy distribution' with the distribution for which the state of each element is independent of the states of all other elements, is Gaussian distributed, and has mean and variance equal to those of its corresponding stationary distribution. Thus, we take X 0 *N x x,S D X ð Þ À Á , where Having defined a distribution for the initial state X 0 , we explain how to compute the expected integrated information, W W DM , for MVAR(1) processes (0.36). The computation proceeds analytically, given the generative model, which is specified by the connectivity matrix A and the covariance matrix of the noise, V~: S E ð Þ. Alternatively, an estimate of W W DM from time-series data can be obtained by using estimates of A and V. The linearregression formulae (0.46) and (0.48) yield the estimateŝ V V~Ŝ S X t jX t{1 ð Þ , ð0:59Þ where the symbolV V denotes empirical quantities. Given A and V, (or their estimatesÂ A andV V), the covariance matrix S X ð Þ can be obtained via the discrete-time Lyapunov equation (0.37), and S D from Eq. (0.57).
To compute the conditional probability P X0jX1~x we first use the MVAR(1) dynamics (0.36) to obtain the distribution of X 1 jX 0~x ' as ð0:61Þ Then we use Bayes' rule (0.15) to obtain : ð0:63Þ From the term quadratic in x' we can obtain the inverse of the covariance matrix of (the Gaussian distributed) conditional variable X 0 jX 1~x as and hence express the conditional entropy H X 0 jX 1~x ð Þin terms of the connectivity and stationary covariance matrices: ð0:65Þ For a given a sub-system M, we have to consider the bipartition X~M,N f g, and the block decomposition of vectors and matrices according to X t~M t ,N t ð Þ T so that The entropy formulae (0.65) and (0.70) furnish the sufficient quantities for computing W W DM as described in the section 'The previous measure, W DM ', using the expression (0.25) for the expected effective information. For present purposes, as with W E , we restrict attention to bipartitions only.

Analytical computation of W E for a general Gaussian case
Here we show how to compute W E analytically, for a general stationary Gaussian system, for any timescale t. Importantly, the generative model for such a system X is always equivalent to an MVAR p ð Þ process [18]: X t~A1 : X t{1 zA 2 : X t{2 z Á Á Á zA p : X t{p zE t , ð0:71Þ where the A i , i~1, . . . ,p, can be thought of as generalized connectivity matrices acting at different time-lags, and E t is a stationary multivariate Gaussian 'white noise' source with zero mean and vanishing auto-covariance function, C t E ð Þ~0, t=0. lie outside the unit circle [33]. The method outlined in 'Computing W E analytically for a Gaussian system' for computing W E with t~1 for an MVAR(1) process is easy to extend to the more general MVAR p ð Þ, any t, case given by equation (0.71). Suppose we wish to compute W E for any value of t up to t~q, where qwp. We first use the fact [33] that the MVAR p ð Þ process is equivalent to the MVAR(1) process involving the block quantities J t~: X t ,X t{1 , . . . ,X t{q À Á T , The stationary covariance matrix S J ð Þ for this process can be obtained from the Lyapunov equation, by analogy with S X ð Þ for