Information-theoretic analyses of neural data to minimize the effect of researchers’ assumptions in predictive coding studies

Studies investigating neural information processing often implicitly ask both, which processing strategy out of several alternatives is used and how this strategy is implemented in neural dynamics. A prime example are studies on predictive coding. These often ask whether confirmed predictions about inputs or prediction errors between internal predictions and inputs are passed on in a hierarchical neural system—while at the same time looking for the neural correlates of coding for errors and predictions. If we do not know exactly what a neural system predicts at any given moment, this results in a circular analysis—as has been criticized correctly. To circumvent such circular analysis, we propose to express information processing strategies (such as predictive coding) by local information-theoretic quantities, such that they can be estimated directly from neural data. We demonstrate our approach by investigating two opposing accounts of predictive coding-like processing strategies, where we quantify the building blocks of predictive coding, namely predictability of inputs and transfer of information, by local active information storage and local transfer entropy. We define testable hypotheses on the relationship of both quantities, allowing us to identify which of the assumed strategies was used. We demonstrate our approach on spiking data collected from the retinogeniculate synapse of the cat (N = 16). Applying our local information dynamics framework, we are able to show that the synapse codes for predictable rather than surprising input. To support our findings, we estimate quantities applied in the partial information decomposition framework, which allow to differentiate whether the transferred information is primarily bottom-up sensory input or information transferred conditionally on the current state of the synapse. Supporting our local information-theoretic results, we find that the synapse preferentially transfers bottom-up information.


Introduction
Predictive coding as a theory arguably dominates today's scientific discourse on how the cortex works [1][2][3].Importantly, it is positioned as a theory of general cortical function-yet, empirical tests so far are limited to situations with an explicitly predictive experimental context, simply to allow for a meaningful analysis.In other words, to find and understand the neurophysiological correlates of predictions and errors, experiments posit a priori, when and what is being predicted in which brain region.There are three problems with this approach: first, knowing what is being predicted when and where in the brain seems to require already a fair understanding of how the brain, or the cortex, works-which may not generally be available yet.Second, trying to acquire some of the necessary knowledge post-hoc, runs the real risk of involuntarily producing a circular analysis or argument, or a "just-so story" (as it is called e.g. in section 4.1 of [4]).Third, restricting empirical tests of a general theory to experimental contexts that are explicitly designed with predictions in mind, in a strict sense, prohibits conclusions about the applicability of that theory in other contexts.One might provocatively frame this third problems as: "Is the cortex doing predictive coding when we don't test it?"Last but not least, the latter restriction to dedicated experimental designs excludes testing (and possibly refuting) the theory by drawing on the vast majority of empirical neurophysiological data, i.e., all data that were obtained with a focus on descriptions of cortical function(s) other than predictive coding, which seems like a waste of available empirical evidence.
In this paper, we introduce an information-theoretic framework for testing predictive coding theories by translating the concepts of predictability, predictions, and prediction errors (surprise) into information-theoretic quantities measurable from data.Based on these information-theoretic formulations, we describe how to test by pure informationtheoretical means whether a neural processing element (neuron, or a small circuit) takes part in a predictive coding-like computation.Our novel method is in principle applicable without knowledge of the intentions of the experimenter, given some weak constraints on the data themselves.Our method rests on the simple idea that a neural processing element that codes for prediction errors should exhibit high information transfer at moments when its input is surprising (i.e.fundamentally unpredictable), and vice versa [5].On the other hand, a processing element coding for the predictable information in its inputs should exhibit high information transfer at times when this predictable information is high.
In the following we formalize this idea in the mathematical context of local information dynamics [6] and partial information decomposition [7].

Local information dynamics
As local information dynamics [6] is a relatively recent subfield of information theory, such that the inspection of local information quantities is not yet widely applied, we (re-)introduce these concepts with some detail.In this exposition we try to balance a concise and intuitive presentation of the material with mathematical rigor, necessarily sacrificing some of the latter.For a more detailed introduction see [5,6].

Local mutual information
For the purpose of this study it is best to understand the mutual information, I(X : Y ), between two random variables by stating that if one variable X has information about another variable, Y , then X and Y can not be statistically independent.This point of view will help to understand why each individual term, log p(x,y)  p(x)p(y) , that contributes to the summation in the mutual information: x,y∈A X,Y p(x, y) log p(x, y) p(x)p(y) , (1) can be soundly interpreted on its own.By x ∈ A X and y ∈ A Y , we denote individual realizations of random variables X and Y , by and we write p(x) as a shorthand for the probability p(x = X).We start our explanation with the definition of statistical independence of X and Y as: meaning that the equation p(x, y) = p(x)p(y) must hold for all pairs of realizations (x, y).If the above equation is violated for any pair (x, y) then this pair contributes to a deviation from independence.As per our initial statement on the relation of independence and information, this pair then also contributes to the information that X holds about Y , and vice versa.
To measure how much independence is violated locally by the pair (x, y), we can take the ratio of both sides of Eq 2. Now, independence, or the absence of mutual information, is equivalent to:

p(x, y) p(x)p(y)
= 1 ∀x, y ∈ A X,Y . ( A deviation of this ratio from 1 for any pair (x, y) indicates a deviation from independence, i.e., the presence of information in the realization of one variable about the realization of the other.Obviously one would like a measure of this information itself to be zero in the absence of information, i.e., at independence.This can be achieved by taking the logarithm of Eq 3: log p(x, y) p(x)p(y) = 0 ∀x, y ∈ A X,Y .( Inspecting equation 4 and comparing to the definition of the mutual information in equation 1, we now see that the mutual information is nothing but the weighted average of the individual deviations from independence, measured on a logarithmic scale.More importantly the above derivation of the mutual information demonstrates that each individual term has a well defined and interpretable meaning.These individual terms define the local mutual information in a pair of realizations, i(x, y): i(x : y) = log p(x, y) p(x)p(y) = log p(x|y) p(x) . (5) Analogously, the local conditional mutual information between two variables in the context of a third is given by: i(x : y|z) = log p(x, y|z) p(x|z)p(y|z) = log p(x|y, z) p(x|z) .(6) We note that the local interpretation introduced here is closely related to the way Fano originally derived the mutual information [8].In addition, we note that the local mutual information and the local conditional mutual information can be negative-in contrast to the (average) mutual information which is always positive or zero.We will explain this fact in detail, and make use of it, further below.

Local active information storage as locally predictable information
Using the local mutual information defined above, we can quantify how much of the information contained in a process (e.g., a neural signal) at the present moment t is predictable from its past.We assume that such a process denotes an ordered collection of random variables, X = {X t }, with realizations x t ∈ A Xt .We then quantify how predictable a single realization, x t , is from its past in the following way: where lAIS is shorthand for the local active information storage [9], and x − is a realization of the (possibly infinite) past of the process up to time t (Fig 1, A).As already mentioned above the local mutual information forming the local active information storage need not be positive, i.e. : This means that a negative lAIS indicates that the x t that actually happened was less expected to happen, given the information in the past of the process, x − , than it was expected to happen without this information about the past.Put differently, the past x − mispredicted the actual value of x t by allocating the probability mass originally contained in p(x t ) to other values in the conditional distribution.Since we assume that all probabilities, including the conditional probabilities above, are computed properly, this is a necessary misprediction, not one that could have been avoided.In other words, negative lAIS indicates unpredictable behavior of the process at time t.If we think of the overall process X as the input to a neuron then negative lAIS at time t means that the neuron must mispredict at t, based on the past of its inputs.
As lAIS is a relatively novel concept in the analysis of neural information processing, a few words on its interpretation and validity in a biological context are necessary.In particular, it is important to consider potential sources of predictability and in what respect this origin of predictability matters.There are in general two possible sources of predictability in a neural spike train: first, predictability arising from temporal statistical dependencies in the input, and second, internally generated temporal dependencies.Does one of these matter more than the other?To answer this question it makes sense to take the point of view of the neuron receiving the spike train.This neuron, throughout its existence has received nothing but the incoming spikes, and has no access to any "ground-truth" about the outside world, and temporal regularities in this outside world.So from a neural information processing perspective, the above distinction must vanish.Thus for the analysis framework presented here, it does in principle not matter whether the stimulus input to the retina was predictable or not; all that matters is whether the incoming spike train from the retina was predictable at the level of the LGN.So from a neuro-centric perspective an analysis without the stimulus properties seems to us to reflect the circumstances a neuron finds itself in.
As we will also see below the predictability mainly resided in the inter spike interval (ISI) distribution of the incoming RGC spikes.However, this is not problematic as predictability has to reside somewhere in the physical and statistical properties of the spike train.This is because in some sense information theory is "only" a summary statistic of some probability distribution (here, the joint distribution of past and present spiking activity).Yet, it is a very special summary statistic in the sense that it truly Overview of analysis approach.A Information-theoretic measures of predictability and information transfer: active information storage (AIS) quantifies the predictability of a processes' current state x t from its immediate past x − , transfer entropy (TE) quantifies the information transfer from a source process X to a target process Y by quantifying the predictability of the target's current state, y t from the sources' past, x − , in the context of the target's immediate past, y − .B Local storage-transfer correlations (LSTC) relating local AIS (lAIS) as a measures of predictability and local TE (lT E) as a measure of information transfer: if a neuron codes for predictable input a positive correlation is expected, if the neuron codes for unpredictable input, a negative correlation is expected (adapted from [5]).C Realizations of predictive coding in the cortex (adapted from [10]): bottom-up sensory input (dotted arrows) is compared to predictions propagated in top-down direction from a hierarchically higher cortical level (solid arrows) that represent the current prior about the input (white bars).See main text.D Physiology of the retinogeniculate synapse and recording sites [11,12]: Recordings were collected from in-and outputs to the synapse between retinal ganglion cells (RGC) and Layer A principal cells (PC) in the lateral geniculate nucleus (LGN).We estimated local active information storage (lAIS, blue arrow) within the synapse input, and local transfer entropy (lT E, red arrow) between in-and output of the synapse.Schematic representations of known connections of PC and RGC are shown in grey (round markers indicate synapses): excitatory cells in layer 6 of primary visual cortex (V1) form feedback connections with LGN PC and also project to LGN inhibitory interneurons (int) and perigeniculate nucleus (PGN).Interneurons provide inhibitory input to LGN PC: intrageniculate interneurones (int) mediate feed-forward inhibition from RGC cells, while PGN cells provide recurrent inhibition [11,13]; PGN interneurons further form reciprocal, inhibitory connections amongst each other (dashed line).
quantifies the relevant amount of information that is stored in or transmitted by a signaland the respective information-theoretic measures are the only consistent measures of this information.In this sense, information theory simply puts an "information processing price tag" on some biophysical process.In other words, methods do not learn anything not already contained in the respective probability distributions, but using information theory the information processing hidden in these distributions becomes visible and quantifiable.
After defining predictability formally via lAIS and reasoning for the applicability of this new concept, all that is left to measure now is whether the neuron is transferring mispredicted information onwards (as in coding for prediction errors) or whether it does not transfer information at those moments when the input is mispredicted (as in coding for the predictable input information only).A measurement of how much information is transferred by a neuron at each moment in time is given by the local transfer entropy [14].

Measuring transmitted information as local transfer entropy from inputs to output
The information flowing from input process(es) (e.g., inputs to a neuron), X, to an output process, Y, (e.g., output of a neuron) at any moment in time, t, is given by the local transfer entropy [14] (Fig 1, A): which quantifies the information transferred from the input's past state, x − , about the present state in Y , y t , in the context of Y's immediate past state, y − .Again, the lT E can be negative; in this case the negativity indicates that there is information in the output that is unexpected, given the past input from X, i.e., negative lT E indicates that the y t that actually happened was less expected to happen, given both the information in its own immediate past y − and the past of the process x − than without the information in x − .In other words, the past x − mispredicted the actual value of y t given the information obtained from y t .Similar to lAIS, this misprediction quantifies behavior of the process Y at time t that is unpredictable from the sources' past, in the context of the target's own past.Using both measures, lAIS and lT E, we are now able to quantify locally how predictable the current state of a single process is from its own past.Secondly, we are able to quantify locally, how much information is transferred from one process to a another.Again, we note that the information-theoretic quantities in question must necessarily be driven by stimulus properties and biophysics.Thus, the informationtheoretic measures do not tell us something entirely removed from other descriptions of neural activity.Yet, only their use allows for a rigorously quantitative interpretation of the biophysical observations in terms of neural information processing.In other words, a rigorous quantitative analysis of neural information processing necessitates the use of proper information-theoretic measures.
Next, we detail how relating both measures enables us to address the important issue of predictive coding.

Local storage-transfer correlations (LSTCs) as an indicator for predictive processing
Using both lAIS on the input to a neuron and the lT E between its input and its output, we can now assess whether a neuron performs a predictive-coding like computation [5,15] (Fig 1, B and C).More specifically: • If a neuron codes for the predictable parts of its input it should have a highly positive local information transfer lT E from input to output at moments t when the predictability of the input as measured by lAIS is high.That is, the correlation between lT E and lAIS should be positive.
• If, in contrast, a neuron codes for the unpredictable parts of its input, then local information transfer lT E should be highly positive at moments t when the predictability of the input as measured by lAIS is very low or even negative.That is, the correlation between lT E and lAIS should be negative.
The first variant of predictive coding theory has been proposed, for example, in adaptive resonance theory (ART) [16,17] or the biased competition model [18,19], which assume that the signaling of bottom-up sensory evidence is facilitated for sensory input that matches the current predictive model (Fig 1, C).The second variant has been proposed, for example, in [20][21][22], where it is suggested that bottom-up signals represent prediction errors, i.e., sensory input that is not predicted by the current internal model (Fig 1, C).
Both variants have been shown to be functionally equivalent such that an implementation of predictive coding can be achieved by both [23] (see also [24][25][26], and the Discussion section).We want to highlight that the assessment of predictive coding using the above described local storage-transfer correlation (LSTC, Fig 1, B) requires only minimal knowledge on the experiment that provided the data, namely, it is sufficient to know how to properly assess the probability distributions involved in the estimation of lAIS and lT E. The approach is thus applicable to data from a vast range of experiments, including those not specifically designed with predictive coding in mind.Most importantly, this approach does not require knowledge on what the brain or a neuron should predict.

Partial information decomposition as a measure of state-dependent and -independent information transfer
Before we go on to describe how to estimate LSTC from data, we want to introduce the framework of partial information decomposition (PID) [7,[27][28][29][30] as a second tool to investigate predictive coding in neural processing.PID is a recent extension to classical information theory.Amongst other applications in neuroscience (e.g., [5,[31][32][33]), PID allows to decompose information transfer measured by transfer entropy (TE) into contributions that are reflective of the calculation of a "generalized prediction error" versus contributions that indicate a relaying of predictable information.
PID describes how two or more source variables provide information about a target variable, where each source may provide unique information (information that is only available from this particular source), redundant information (information that is available redundantly from two or more sources), and synergistic information (information that is only available when considering two or more input variables together) [7].Note that in this study, we apply a non-localized measure of PID [34] (discussed in detail below), and therefore refer to averaged quantities only (for first proposals of localized PID measures see [29,35]).
To illustrate how PID can be used to decompose TE [27], we first take a closer look at the calculation of the TE as the conditional mutual information I(Y t : X − |Y − ).Here, 7/35 conditioning on the target's past state, Y − , influences the information the inputs' past state, X − , provides about Y t in one of the following manners, • in the context Y − , X − may provide less information about Y t such that I(Y t : • in the context Y − , X − may provide more information about Y t such that I(Y t : • there may be no change in the information provided by X − about Y t , such that I(Y t : X − |Y − ) = I(Y t : X − ), i.e., Y 's past is independent of X − and Y t , and knowing Y 's past does not influence the information we obtain from X − about Y t .
These changes in information contribution may be decomposed and quantified using PID terms [27]: The first case may be interpreted as scenarios in which information about Y t is redundantly present in both past states, X − and Y − , such that by conditioning on Y − this redundant information is "removed" from the information X − provides about Y t .The second case describes scenarios in which both past states provide synergistic information about Y t , which is "added" to the information X − provides uniquely about Y t .Note that both redundant and synergistic information contribution can be simultaneously present in the interaction of two variables with respect to a third.The third case can be loosely thought of as the information X − entering Y t being both, independent of Y − and being encoded into Y t independently of Y − -thus, it reflects a unique information transfer from X − to Y t .In sum, when calculating TE, i) we remove redundant information in X − and Y − about Y t , ii) we measure synergistic information jointly present in X − and Y − about Y t [27], and iii) we measure the information provided uniquely by X − about Y t .

State-dependent transfer entropy as generalized prediction error
PID allows us to decompose information transfer from a cell's inputs, X − to its output Y t , conditional on the target's past, Y − , into different contributions: We can quantify the information uniquely provided by X − about Y t , independently of Y − , also termed state-independent TE [27], and we can quantify the information provided by X − about Y t synergistically with Y − , i.e., dependent on the state of Y − , also termed state-dependent TE [27].One may think about the latter case as the target's past state "decoding" the information transferred from the source to the target.
The computation of state-dependent TE, i.e. the synergistic part of the TE, is of particular relevance here, as the synergy reflects the computation of a "generalized prediction error" from the past state of the target cell (the prediction) and the past state of the input (the sensory evidence) and the error's transfer by the target neuron.This can best be seen by considering that the computation of a binary error (e.g., in a spiking neuron) is analog to the XOR operation and that this operation leads to synergistic information: here, knowing only one input is not sufficient to know what the output of the system should be-this is only possible if both inputs are considered at once.

Partial information decomposition measures and estimation
The PID framework [7] extends classical information theory to allow for the decomposition described in the last section by proposing a set of new axioms.However, the exact choice of PID axioms, which allows the definition of appropriate measures, is still an active area of research at the time of writing [28-30, 34, 36, 37].To estimate PID, we here follow a proposal by Bertschinger et al. [34] that is based on the estimation of unique 8/35 information, which is grounded on the assumption that in a suitably chosen decision problem, exploiting the unique information should yield a measurable advantage.
An implementation for estimating Bertschinger et al.'s measure [34] has been proposed in [38,39] and is available as part of the IDTxl toolbox [40].

Estimation of information-theoretic quantities from data
Typically, in experimental neuroscience the probability distributions underlying observed data, which are necessary to calculate the quantities introduced above, are unknown and have to be estimated from data.We will therefore introduce the estimation of local information-theoretic quantities and PID terms from data in this section.
The most straightforward approach to estimating (conditional) mutual information from discrete data (Eq 7 and Eq 9) is by replacing probability mass functions by the relative frequencies of symbols observed in the data [41].These so-called "plug-in estimators" are well-known to exhibit negative bias for finite data, for which analytic bias-correction procedures exist [42,43].These bias-correction approaches, formulated for non-local variants of mutual information, may be adapted for the use with localized measures to obtain locally bias-corrected estimators of lAIS and lT E (see supporting information S1).Furthermore, statistical testing against estimates from surrogate data may be applied to handle estimator bias [44,45] by treating the estimate as a test statistic compared against a Null-distribution generated from estimates from surrogate data.
Before applying estimators, past states of the time series involved have to be defined.In theory, both AIS and TE quantify the information contained in the semi-infinite past of a time series up to, but excluding, time point t.In practice, few observed systems actually retain information for an infinitely long time, such that most information is contained within the immediate past of the present system state, X t [46,47].Hence, we can define an "embedding" of the time series, X S , i.e., a collection of past variables up to a maximum lag, selected such that the embedding is maximally informative about X t .In mathematical terms, we define the embedding, X S , such that the Markov property is fulfilled for all X t .In other words, X t becomes conditionally independent of all variables prior to X S .Several approaches for defining such an embedding exist.We here propose the use of a nonuniform embedding [48,49], that selects variables from a set of past candidate variables, C, such that X S becomes maximally informative about X t .A suitable algorithm that handles the computational complexity of selecting this set of variables is a greedy forward-selection strategy that maximizes the information contained in the variable set with respect to X t using the CMI as selection criterion, where X S i is the set of variables already selected in the ith step of the algorithm and C are candidate variables from the set of candidates C i .The candidate set is defined as a collection of past variables up to time point t, C = {X t−l , X t−l−1 , . . ., X t−k }, where l k denote a maximum and minimum lag with respect to t. Surrogate testing is used to evaluate whether the selected variable, C * provides additional information about X t by testing whether I(X t : C * |X S i ) is statistically significant.If so, C * is included into the embedding and removed from the set of candidates, 9/35 For a detailed account of the estimation procedure including a hierarchical statistical testing scheme that handles the family-wise error rate of the repeated testing during the iterative candidate selection, see [50] and the implementation in [40].
The greedy strategy for constructing past states can be directly applied to find a non-uniform embedding of the past of a process X, such that we can estimate lAIS as For the construction of past states for lT E estimation, we first optimize the target embedding, y S (which amounts to quantifying the active information storage in the target) [51], before optimizing the source's embedding in the context of the target embedding, where sets C i and x S i are updated according to Eq 14.We can then estimate lT E as By first optimizing the target's past state, we make sure that we account for all information Y's past provides about the current state Y t , before quantifying additional or novel information X provides about Y.This means that only information actually transferred between X and Y is taken into consideration when estimating lT E.
We used a software implementations of the proposed approach provided by the IDTxl Python toolbox [40], which internally makes use of plug-in estimators implemented as part of the JIDT toolbox [52].For bias-correction, we used a Bayesian counting procedure implemented in the pyEntropy toolbox [53].For estimation of PID measures, we used the measure by Bertschinger et al. [34] and an estimator by Makkeh et al. [38,39], which is also part of the IDTxl toolbox.Analysis code is available from [54].

Empirical data set
We demonstrate the application of the proposed local information dynamics framework on spike train recordings from the retinogeniculate synapse of the cat.Spike trains were recorded from 17 retinal ganglion cells (RGCs) and monosynaptically coupled principal cells in the lateral geniculate nucleus (LGN) [55].We estimated lAIS in the input to the synapse, i.e., the RGC spike train, and lT E between the input and the output of the synapse, i.e., from the RGC to the LGN spike train (Fig 1, D).We calculated LSTC to test whether information was preferentially transferred whenever the input signal was predictable or when it was non-predictable and we calculated the PID of the information transferred to validate our findings.
A detailed description of surgical procedures, task, and data recordings can be found in [55] 1 .All surgical and experimental procedures were performed with the approval of the Animal Care and Use Committee at the University of California, Davis.

Surgery
For electrode placement at the RGC and LGN (Fig 1, D), adult cats of both sexes were initially anesthetized with ketamine (10 mg/kg, i.m.).For electrophysiological recordings, animals were placed in a stereotaxic apparatus and mechanically ventilated.Electrocardiogram (ECG), electroencephalogram (EEG), and expired CO 2 were continuously monitored, while anesthesia was maintained with thiopental sodium (2 mg • kg −1 • h −1 , i.v.).Thiopental administration was increased if physiological monitoring indicated a decrease in the level of anesthesia.
Once electrodes were positioned and minimum eye movement was ensured, the animal was paralyzed using vecuronium bromide (0.2 mg ).The pupils were dilated with 1 % atropine sulfate and the nictitating membranes were retracted with 10 % phenylephrine.Flurbiprofen sodium (1.5 mg/h) was administered to ensure pupillary dilation.The eyes were fitted with contact lenses and focused on a monitor located 1 m in front of the animal.

Visual task
Visual stimuli were created with a VSG 2/5 visual stimulus generator (Cambridge Research Systems) and presented on a gamma-calibrated Sony monitor with mean luminance of 38 cd/m 2 .Receptive fields were mapped using a binary white-noise stimulus that consisted of a 16 ×16 grid of squares [56].Each square flickered independently between black and white according to an m-sequence [56,57].The monitor ran at a frame rate of 140 Hz.Approximately 4 to 16 squares of the stimulus overlapped the receptive field center of each neuron.

Electrophysiological recordings
Simultaneous single-unit recordings were conducted at the RGC and the contralateral layer A LGN cells.To maximize the chances that both cells were monosynaptically connected, a seven-channel multielectrode array (Thomas Recording) was placed in the LGN; through stimulation with a spot of light, the retinal area with the highest evoked response was identified.Cell responses were analyzed using an audio monitor.
Neural responses were amplified, filtered, and recorded with a Power 1401 data acquisition interface and the Spike 2 software package (Cambridge Electronic Design).The spikes of individual neurons were isolated using template matching, parametric clustering, and the presence of a refractory period in the auto-correlogram.
Recordings from 17 cell pairs entered further analysis.Recordings had an average length of 788.4 s (± 441.6 s SD, see supporting information S1).
To assert connectivity between recorded cells, the cross-correlogram between both recordings was visually inspected for abrupt, short-latency peaks using a bin-size of 0.1 ms (see [55], Fig 1).The occurrence of such a peak was seen as evidence for a monosynaptic connection between RGC and LGN cell [58][59][60].For peaks a baseline mean was calculated from bins 30 ms to 50 ms on either side of the peak bin.The peak bin and all neighboring bins with counts >3 SD were considered to contain retinal spikes triggering an LGN spike.The percentage of these spikes was termed the efficacy of the RGC [59][60][61].Furthermore, an RGC's contribution was defined as the percentage of LGN spikes that were triggered by a spike in the corresponding RGC.Contribution may be interpreted as the "strength of connection" between two cells in a pair [55].We called an RGC spike relayed if it was followed by a LGN spike after its reconstructed information transfer delay, u (see next section).
For further analyses recorded spike trains were binned into 1 ms segments.

Estimation of lAIS and lTE from empirical data
We optimized nonuniform past-state embeddings for each cell pair recording using the greedy algorithm implemented in [40].For lAIS estimation, we set the maximum lag, j, defining candidate variables for the embedding to 30 ms; for lT E estimation, we set the maximum lag in the source, k, to 40 ms and the maximum lag in the target, l, to 30 ms.These lags assume that only spikes with an inter-spike interval (ISI) of 30 ms and less are relevant for triggering a LGN spike [58,59,[62][63][64][65], where especially ISIs <10 ms are effective in driving LGN responses.
For optimizing the lT E target past, we additionally account for a informationtransfer delay, u, between RGC and LGN of up to 10 ms, which is in line with the cross-correlation observed between spiking in RGC and LGN [55].We reconstruct u from the optimized embedding by identifying the lag of the past source variable that has the highest information contribution to the target's current state, quantified by the conditional mutual information I(x u :

LSTCs calculation for empirical data
To investigate whether the retinogeniculate synapse preferentially transferred predictable or unpredictable information, we correlated sample-wise estimates of lAIS and lT E by calculating the Pearson correlation coefficient between both measures.Note that we may also calculate measures that capture relationships of higher order, e.g., the mutual information.However, since our goal was to infer whether the sign of the correlation was positive or negative, we calculated the linear correlation.Tests for statistical significance were performed using a permutation test with 1000 permutations.

Optimization of estimation parameters
For estimation of local information-theoretic measures, we first optimized past states for lAIS and lT E estimation individually for each cell pair.Over all cell pairs, the mean lag of variables identified for the lAIS embedding was 7.63 ms (SD: 1.82 ms), and for lT E embedding was 2.75 ms (SD: 1.24 ms) for the source and 6.06 ms (SD: 1.53 ms) for the target embedding.The reconstructed delay, u, between the RGC and LGN cell was on average 2.81 ms (SD: 1.05 ms), while individual delays matched maxima in the cross-correlogram between RGC and LGN recordings.(See supporting information S1 for descriptive statistics of data entering the analysis, and supporting information S2 for a list of all reconstructed parameters.)

LSTC
Based on the optimized past states, we estimated lAIS and lT E for all cell pairs and found significant storage and transfer in all pairs except for pair 5, which was excluded from all further analyses.For remaining cell pairs, we calculated the LSTC and found a significant, positive correlation coefficients for 14 of the remaining 16 pairs, indicating that local information transfer was higher at samples with higher local information storage (coefficients ranged from 0.0056 to 0.2675, see also supporting information S3).With respect to predictive coding strategies, the positive LSTC indicates a higher transfer of information whenever an input sample was more predictable from its past, and less 12/35 transfer when it was unpredictable.We further found that correlations were stronger in cell pairs with a high RGC contribution (Fig 3, c(LST C, contribution) = 0.6879, p = 0.0030 * * ).In a cell pair, the RGC's contribution is defined as the percentage of spikes in the LGN cell that were triggered by a previous spike in the RGC and may be interpreted as the pair's "strength of connection" [55].Hence, the effect of predictable information being relayed across the retinogeniculate synapse was more pronounced in synapses that were more strongly connected.Estimates of the contribution of each cell pair were taken from [55].

Information dynamics of relayed and non-relayed RGC spikes
For all observed pairs, typically only a fraction of RGC spikes was relayed to the LGN, i.e., followed by an LGN spike with the reconstructed information transfer delay, u.Remaining RGC spikes were considered non-relayed.We investigated whether such relayed RGC spikes differed in their local information dynamics from non-relayed spikes.
On average, relayed RGC spikes were accompanied by higher lAIS and lT E compared to non-relayed spikes (Fig 2 and Fig 4).Results indicate that, first, relayed spikes were in general more predictable from the RGC's cells immediate past spiking behavior.Second, relayed spikes were accompanied by higher local information transfer, while non-relayed spikes were accompanied by negative local information transfer.Negative lT E here means that for some cell pairs, in the absence of an LGN spike, the RGC's state (spike) was misinformative about the next state of the LGN (no spike).In other words, observing a prior RGC spike lowered the probability of observing no spike in the LGN.
Relayed spikes were characterized by both higher lAIS and lT E values.We were able to classify whether a spike was relayed from its lAIS value above chance, using a k-nearest neighbor classifier with k = 5 (classification accuracy was also higher than the baseline model, see supporting information S2).However, note that lAIS may be seen as a different representation of spiking statistics of the RGC, i.e., the number of spikes and ISI in a given time window (as can be seen, for example, when considering spike-triggered averages of lAIS and RGC spike counts in Fig 4).As a result, whether a spike was relayed could be equally well predicted from the spike count of all spikes up to 30 ms prior to an RGC spike, or the RGC spike's prior ISI (see supporting information S2).We therefore want to highlight that the estimation of lAIS provides no additional, mechanistic explanation on when a spike is relayed at the retinogeniculate synapse-it rather provides a computational interpretation of the mechanisms already known (see Discussion).

Information dynamics of inter-spike intervals
Next, we investigated the local information dynamics of RGC and LGN spikes as a function of the preceding ISI, as ISIs have been reported to have an effect on whether an RGC spike drives a response in the corresponding LGN cell [58,59,[62][63][64][65].We calculated ISIs by subtracting the spiking times of all consecutive spikes in the RGC spike train (Fig 5A ).
Average lAIS was positive for RGC spikes with a preceding ISI of 2 ms to 7 ms, with a maximum at 3 ms.The lAIS was negative for all other investigated ISIs (Fig 5B).Hence, the most frequent ISIs lead to higher predictability of the spike.When differentiating between relayed and non-relayed RGC spikes, lAIS was positive for ISI of 1 ms to 6 ms for relayed spikes while the range of positive lAIS values for non-relayed spikes was 2 ms to 7 ms.Overall, relayed and non-relayed spikes did not differ in lAIS as a function of ISI (Fig 5D ).lT E was positive over the whole range of investigated ISIs (Fig 5C).However, when differentiating between relayed and non-relayed spikes, lT E was negative on average for all ISIs for non-relayed spikes with a minimum at 2 ms (Fig 5E).
We further investigated RGC spike tuples, because whether RGC spikes are relayed is mostly influenced by the most recent previous spike while events further in the past have only minor influence [59].Tuples are defined as two spikes with an ISI below a given threshold and a "silence time" preceding the first spike to ensure a comparable level of prior activity [59].We here used a silence time and maximum ISI of 20 ms, which covers the maximum history length used in the estimation of lAIS and lT E (supporting information S1), such that spikes with a prior ISI >20 ms did not influence lAIS and lT E estimates.
We computed spike triggered averages (STA) of lAIS and lT E values for spike tuples (two consecutive spikes with an with an ISI <20 ms, Fig 6).On average, the first spike in a tuple was associated with negative lAIS, indicating that the spike's immediate past, i.e., the silence time, was misinformative about the spike.For an ISI of 3 ms to 8 ms, the second spike was associated with increased lAIS, relative to the average lAIS in the silence time, indicating high predictability from the immediate past.On average, lT E values were slightly increased for the first and second spike in a tuple, with higher values for the second spike.In sum, the predictability of an RGC spike strongly dependent on prior spiking activity, with higher predictability if the ISI was between 3 ms and 8 ms.

State-dependent and state-independent information transfer
Last, we estimated unique, I unq (Y t : X S ), and synergistic information, I syn (Y t ; X S , Y S ), the RGC's past provided about the spiking behavior of the LGN principal cell, using the measure by Bertschinger et al. [34,38] ( Fig 7).In 11 out of 16 cell pairs with significant information transfer, the unique information provided by the RGC's past state, X S , dominated the information transfer from RGC to LGN.Hence, information transfer was governed by information transfer independent of the state of the LGN's past state.Again, this supports the notion of information transferred mainly in a bottom-up fashion, i.e., transfer independent of the target cell's state.LGN cell, measured by the synergistic information, I syn (Y t ; X S , Y S ) (dark gray), and unique information, I unq (Y t ; X S ) (light gray).In 11 out of 16 pairs, more than half of the transferred information was predominantly state-independent.

Discussion
We introduced an information-theoretic framework for testing predictive coding strategies in neural data.The framework expresses predictive coding concepts, namely predictability, predictions, and prediction errors, in terms of information-theoretic quantities, which can be immediately estimated from data.Hence, the framework does not rely on markers of predictive coding that have to be defined a priori, but on properties of the data itself.As a result, the framework is able to investigate neural processing strategies in any data set, independently of the experimental task under which the data were collected.
We applied the framework to spike recordings from the retinogeniculate synapse of the cat and identified the preferred coding strategy of the synapse, namely the transfer of predictable over unpredictable input.In particular, we showed that the RGC-LGN synapse preferentially transferred information whenever the input from retinal cells to LGN cells was highly predictable from its immediate past.This conclusion is supported by the finding that the transferred information was predominantly bottom-up input to the synapse, and was independent of the state of the LGN principal cell.

Avoiding circular arguments in the investigation of predictive coding theory
The aim of this study was to develop an information-theoretic framework that allows to test predictive coding theories in arbitrary recordings of neural data, that is, recordings that were performed in setups not specifically tailored to investigate predictive coding.One key motivation for such a framework was to avoid the use of circular arguments where the researcher's assumptions on what inputs the brain should predict are used to design stimuli and paradigms, that are then used in neurophysiological experiments to test whether and how the brain predicts these inputs.As a result, the experimenter's interpretation of neural activity becomes dependent on the a priori defined theory motivating the experimental setup and the expected manifestation of a prediction error in the data (see [4,[66][67][68] for other descriptions of this problem).This motivation is based not least in the difficulties we experienced ourselves when designing and performing predictive coding experiments, and interpreting the data (e.g. in [69]).However, we do not want to suggest that all experiments necessarily suffer from these difficulties.
Rather, we fully acknowledge that in some cases the necessary knowledge to carefully design non-circular predictive coding experiments will be available.However, we are concerned that this is not the case in general.
The problem of "interpreting" neural activity in terms of predictions or prediction errors becomes even more severe if arbitrary processing elements in the cortex are investigated in isolation, e.g., single cells, whose computations, as well as input and output are far removed from any human-understandable function (i.e.we have to treat it as "intrinsic computation" in the sense of [70]).Here, an approach is required that analyzes signals not from an "experimenter-as-receiver", but a "cortex-as-receiver" or "neuron-as-receiver" point of view [71].The former view assumes that neural signals at arbitrary processing stages carry human-understandable information-which may express a misleading view on information processing in the brain in general-, while the latter view considers the question of how other processing units in the cortex view available information.
Here, local information dynamics provides a tool that allows for the direct investigation of computations performed by arbitrary processing units.By expressing the building blocks of predictive coding theories, namely, prediction, information transfer, and prediction errors, in terms of information-theoretic quantities, these concepts become measurable properties of the data such that we were able to formulate testable hypotheses on how these quantities should relate given two opposing predictive coding strategies.Our approach thereby did not rely on an interpretation of the data recorded in terms of the experimental task performed (compare, for example, previous studies investigating the facilitation of predicted input over the propagation of prediction errors [18,[72][73][74]).Rather, using information-theoretic measures allowed us to "abstract away" from the experimental setup such that predictive coding could be investigated independently of the task.The presented approach not only allows for a straightforward testing of competing information-processing strategies in arbitrary neural systems, but also opens up the possibility of testing predictive coding theories on data from a myriad of neurophysiological experiments not initially designed with predictive coding in mind.
We acknowledge that the proof-of-principle analysis presented here is an extreme case of application: The random stimuli lack all predictability, except for the short life-time of a video-frame.Thus, any predictability in the spike time series is internally generated in the retina.As a consequence, one may ask whether our analysis-although technically sound-is truly relevant, and whether the example analyzed here supports a more general applicability.For two reasons we think this is indeed the case-first, lAIS must certainly further rise in all processing stages close to a predictable stimulus-compared to the unpredictable one used here, thus increasing the signal-to-noise ratio relevant for its estimation; second for a neuron receiving a spike train, it will essentially be difficult to distinguish internal from external sources of predictability in that spike train-at least without resorting to information from other (neural) channels.This latter problem of exploiting information from additional channels is indeed important in predictive coding and discussed next.How such side-channel information can be integrated in our analysis framework is discussed further below.

Quantifying the predictability of neural signals as a proxy for predictions
When investigating predictive coding strategies at the retinogeniculate synapse, we quantified the self-predictability of the input signal to the synapse in order to quantify what portion of the signal was predictable from its own immediate past.We used this predictability as a proxy for measuring and quantifying an actual prediction of the synapse's input from all available inputs (e.g., through feedback connections from the cortex to the LGN principal cell etc.).Using the self-predictability as a proxy for quantifying predictions introduces three approximations:

19/35
The first approximation is that a mutual information is used in place of an actual predictive model for the RGC inputs embodied in the organism.This mutual information is an upper bound on the mutual information between an actual model prediction based on the RGC inputs and the realized future samples of a time series.The second approximation introduced here relates to the practical estimation of the local active information storage from spike trains.With limited data, this can lead to biased results, either by overestimation for the case of severely limited data availability or by underestimation if the analyzed history length is artificially shortened to curb 'curse of dimensionality' problems in the estimation (see also [75] for a discussion of these issues).Given the large amount of data available here we do not consider this limitation to apply.Last, a third approximation is introduced by looking only at the information available for prediction in the input spike train of the RGC cells.It is obviously conceivable that neurons from other brain regions that have access to information arising from multiple retinal inputs across a wider range of visual field locations could potentially improve a prediction above what is predictable from the RGC input itself.This is indeed a valid concern.Below, we present a way of incorporating such information into the estimation of predictable information.In the present experiment such additional influences, e.g. from cortical feedback, will possibly be negligible do to the anesthetized state of the animal, and the random nature of the stimulus.In other scenarios, information from cortical channels, or other side channels, may have to be incorporated using Eq 19 below.
This first assumption holds under a second assumption, namely, that we choose the past state of the input used in the estimation of AIS such that we indeed capture all relevant statistical regularities.In theory, the AIS is defined as the mutual information between a processes' present state and its semi-infinite past [9].Hence, in practice one has to find a suitable, finite embedding that covers only the relevant past [9,47,48].Such an embedding is found through the approach used here, where we optimized a non-uniform embedding that covered a time horizon of 30 ms, which was identified in previous work as the time horizon over which spikes affect future spiking behavior of the RGC [58,59,[62][63][64][65].

Quantifying prediction errors in neural signals
We can use predictability not only as a proxy for the prediction of an input signal, but also as a proxy for inevitable prediction errors: If predictability of the input signal is low, and thereby its lAIS,-according to our first assumption above-any reasonable model predicting the synapse's next state from this input must generate a prediction error-simply by virtue of reflecting the underlying probabilities.
As a second approach to quantifying prediction errors, we proposed to calculate the synergistic portion of the information transfer between the synapse's in-and output, using the recently proposed PID framework.In particular, we propose that high synergistic information between the input-cell's past state and the target-cell's past state about the target-cell's next state reflects transmission of a prediction error.This is because both, the past state of the input cell, providing the sensory evidence, and the past state of the target cell, providing the prediction, have to contribute to the computation of the target cell's next state, if this state reflects a prediction error.In contrast, transfer of primarily bottom-up, sensory evidence is indicated by high unique information in the input cell's past state about the target cell's next state, while the synergy between both past states is low (this is discussed in more detail in section Predictive coding at the retinogeniculate synapse below).

Quantifying information transfer between neural signals
Next to predictability and prediction errors, we estimated information transfer across the synapse using TE.TE quantifies how much information is transferred from an input to an output variable.Thus, TE serves as a natural measure of information transfer serving predictive coding, i.e., the transfer of novel information from input to output of the receiving cell.As discussed in the previous section, the information transfer across the synapse, quantified by the TE, comprises both information uniquely provided by the synapse's input, but also information transferred due to synergistic effects between past state of the input and past state of the target cell.
Next, we will review how our information-theoretic results relate to possible biophysical implementations of the computations performed.

Linking information-theoretic results to biophysiology
We found that information transfer was highest for highly predictable RGC spikes and that these input spikes were typically preceded by another input spike with a short advance.Our findings are in line with previous studies showing that RGC spikes with a preceding RGC spike were more effective in driving an LGN response than single spikes [55,59,63,65,76], and that this efficacy was even higher for ISI <10 ms [59,[63][64][65][76][77][78].
It has been hypothesized that double spikes are important to enable temporal summation at the post-synaptic membrane: Carandini and colleagues [79] presented a model of the retinogeniculate synapse in which information transfer was governed by temporal summation of pre-synaptic excitatory postsynaptic potentials (EPSP).Here, EPSPs remained approximately constant or even increased for smaller ISI.Hence, the dominant biophysical mechanism enabling information transfer at the synapse seemed to be post-synaptical summation rather than a change in pre-synaptic conditions due to enhanced spike-rates.This fits with the LGN's limited ability to integrate spikes over large time windows, where the typical time constant for X-and Y-cells in the LGN is measured to be 15 ms to 22 ms [80].The cells' true ability to perform temporal summation may be even lower because the time constant may not be a suitable measure of the ability for temporal integration under real-world conditions [81].Temporal summation as a mechanism is further supported by the fact that LGN principal cells receive input from just a small number of RGCs of which one is typically the main driver [64,82], furthermore, also single RGC cells are able to drive the target LGN principal cell [59,62,63,65,77], such that population coding is an unlikely mechanism for the information transfer at the retinogeniculate synapse.Also-as was noted by Rowe and colleagues-contribution rises strongly under "structured" visual stimulation [63].
In sum, temporal summation over incoming spiking activity on short time-scales is a likely mechanism for driving information transfer from RGC to LGN principal cells.Our findings are compatible with this mechanism, as information transfer was highest for the second RGC spike in tuples with short ISI (see Fig. 6).This spike also had high predictability, explaining the observed correlations based on the above biophysical mechanism.Furthermore, when applying partial information decomposition we found predominantly unique information transfer from RGC to LGN, which is in line with the fact that almost all LGN spikes are triggered by an RGC spike [55,77].In sum, our novel framework yields results that have a mechanistic explanation here in the observed properties of biological neurons (and it would be a reason for concern if this were not the case), but it adds another explanatory layer to the mere biophysiological description.It does so by measuring exactly what the information processing consequences of the established biophysical principles are.This computational description in informationtheoretic terms allows us to bridge the explanatory gap between biophysics and predictive 21/35 coding theories.Also, our information-theoretic analysis shows that basic predictive coding for reliable information (i.e. for predictable inputs) may in some cases be realizable by cellular biophysical principles alone, if input spiking statistics allow for exploiting these principles.It is an open question, however, whether cellular biophysics alone could also be exploited for coding for prediction errors.

Predictive coding at the retinogeniculate synapse
We applied the proposed local information dynamics framework to investigate which of two alternative predictive coding strategies were used at the retinogeniculate synapse.In particular, we tested, whether the synapse coded for unpredictable or surprising input versus predictable input.Both coding strategies have been formulated as part of wider theories of predictive coding.The first strategy is proposed by, for example, [20] or [21,22], and states that bottom-up signals in the cortical hierarchy generally reflect the propagation of prediction errors.This family of predictive coding theories also proposes that the top-down signals represent predictions made at a higher cortical area about the next lower area in order to "explain away" sensory input at the lower area [20][21][22].Here, the bottom-up error-signal represents the part of the top-down prediction not explained away and thus signals the mismatch between prediction and input [83].The second coding strategy is proposed as part of a further family of predictive coding-like theories, which oppose the propagation of prediction errors and instead state that the bottom-up signal in the cortical hierarchy represents predictable input.Examples of this family are, amongst others, ART [16,17] and the biased competition model [18,19].Both theories assume that sensory input matching top-down information is amplified and propagated up the cortical hierarchy.
Currently, it is an ongoing debate which of the two proposed strategies neural systems use.Both strategies have been shown to be equivalent on a functional level (they use the principle of predictive coding to realize perception and action in the cortex), while they differ on an algorithmic level [84], and as a consequence in their implementation.Spratling and colleagues showed that both strategies are equivalent in their ability to realize predictive-coding-like information processing in artificial neural networks [23].This was supported by Kveraga et al. [24], who suggest that different realizations of PCT, such as biased competition model, ART, and error-coding theories, could be easily accommodated by a computational model of top-down and bottom-up information processing presented in [25] (see also [26] for a further comparison of theories on top-down activity).Here, our framework provides an alternative approach to testing neural coding strategies against each other.

Evidence for coding for predictable input found at the retinogeniculate synapse
Applying the proposed framework, we found that the retinogeniculate synapse preferentially coded for predictable input.In 15 of 17 investigated cells, local predictability of the input correlated positively with local information transfer between input and output indicating the preferential transfer of predictable input.Also, we found that RGC spikes were more efficient in driving an LGN response if they were highly predictable from their immediate past.Lastly, we used PID [7,34] to decompose information transfer from the RGC to the LGN principal cell into information uniquely provided by the RGC about the next state of the LGN and into synergistic information provided jointly by the RGC's and the LGN's past.We found that unique information in the RGC was transferred in the majority of investigated cell pairs, opposed to the transfer of synergistic information, which indicates that primarily sensory evidence was transferred across the synapse.
The last finding on information transfer being dominated by unique information being relayed from the RGC supports a coding strategy for predictive information as follows: if the information transfer across the synapse served the propagation of prediction errors, we would expect high synergistic information transfer across the synapse.This is due to the fact that to calculate the occurrence of a prediction error, both, the past state of the target cell-the prediction-and the past state of the input cell-the sensory evidencehad to be known.The error would then be computed from comparing these two inputs, leading to a response if there was a mismatch between the two states.Technically, for single spiking events, perfectly determining the occurrence of a prediction error is equivalent to a binary XOR operation, which leads to purely synergistic information between the two inputs and the output.That is because knowing only one input does not provide any information what the output should be (see also the next section).While it is also well known that single neurons can only approximate a binary XOR, this would still lead to considerable synergistic information.Conversely, if the information transfer across the synapse served the propagation of predictable information, we would expect low synergistic information and some unique information in the input about the output (in a process similar to a binary AND operation).The latter scenario is in line with our empirical findings, indicating that the input to the synapse provided unique information about the next state of the target cell (observing a spike in one source increases the probability of observing a spike in the output).Hence, PID results support the coding for predictable input rather than coding for prediction errors.
Lastly, we want to emphasize that the framework presented here does not provide us with new information about the processing at the synapse, i.e., information that is not already contained in the spiking statistics of the cell.Rather, the framework is an approach that-while being task-agnostic-is able to cast computations performed by the biophysical dynamics into a quantitative and human-interpretable form.Indeed, our finding of a preferential transfer of predictable input sheds an interesting light on the findings in [55].Predictable input to the LGN cells (spike tuples) is produced when an RGC cell is stimulated by its preferred input.Thus, the signals relayed by LGN cells are strongly representational in nature, rather than differential.In sum, the biophysical mechanism and its function in terms of enhancing representations was known, but our analysis adds a quantitative computational interpretation within a framework that can be applied to radically different circuits for comparison.

Potential influences of anesthesia on cortical feedback and predictive coding
In our analysis, we used recordings from animals under anesthesia, which may affect our results due to the well-known change in information transfer under anesthesia, predominantly in top-down direction [85][86][87][88][89][90].Hence, under anesthesia top-down information transfer from V1 to the LGN is very likely reduced and it is conceivable that LGN function-and thus the algorithm embodied by the synapse-differs between the anesthetized and awake state.V1 activity affects LGN function via direct and indirect connections (e.g.[11,13,91]), whose functional role may vary between facilitation and suppression of LGN spiking [91,92].As a result, the algorithm embodied by the retinogeniculate synapse may change if the cortex is active during recordings-however, conducting recordings in the awake state may be technically challenging and the analysis of data from complex models of the retinogeniculate circuit, as for example recently by Rogala and colleagues [12], may pose a viable alternative.
However, investigating information processing at the retinogeniculate synapse while V1 is active would allow us to integrate recordings from V1 as second input to the LGN into our analysis.This approach is an alternative to quantifying prediction errors by measuring information storage in the RGC input alone, thus circumventing the limitations of this approach that were discussed above.We want to highlight that if 23/35 input from V1 or other sources became available in data sets, the presented framework is easily extendable to include such sources.In this case, one may define the predictive information storage, as information provided by both the past of V1 and the RGC input about the RGC's future state.Similar to the analysis of local storage-transfer correlations above, the PID-based analysis of coding for predictable versus coding for unpredictable information can be adapted to an additional input: if the synapse coded for prediction errors, we would expect information transfer only in case of a mismatch between top-down input from V1 and bottom-up input from the RGC.In other words, the LGN should spike whenever it received a spike exclusively in the top-down signal or in the bottom-up signal.This is measured by the synergistic information, I syn (LGN + : RGC − , V 1 − ).If the synapse coded for predictable input, we would expect information transfer in case of matching inputs, i.e., the LGN should spike whenever it received a spike in both input signals.This is measured by the shared information, I shd (LGN

Relation of results from the retinogeniculate synapse to theories of predictive coding in the neocortex
We remark that the results presented here should not be seen as a refutation of predictive coding theories that propose the bottom-up propagation of prediction errors as a general information processing principle in the neocortex [20][21][22].This is simply because we did not investigate information processing in the cortex and the results presented here should thus not be seen as general evidence against this family of theories.Rather, different algorithms may be used at other levels of the processing hierarchy, especially within the cortex [10,76,93], even though our study provides evidence for an enhancement of predictable information in the subcortical visual system.

Analysing cortico-cortical predictive coding using storage-transfer-correlations
The analysis presented here relied on the fact that we were able to analyse the transmission of information from the inputs to the output of a single neuron because both the relevant inputs as well as the corresponding outputs were directly recorded.When transferring our analysis framework to cortico-cortical predictive coding, we face two obvious difficulties: first, it will become next to impossible to cover all relevant inputs to a neuron with sufficient spatio-temporal resolution (although some in-vivo single-cell optical techniques hold some promise here); second, the estimation of the resulting high-dimensional probability distributions will certainly pose an extreme challenge.At present the best way forward here seems to rely on summary signals such as local field potentials (LFP) or optical techniques with a coarser resolution.This, however, then results in the difficulty that we are no longer in a position to analyse the information transferred through a single neuron.Thus, in this case we will have to resort to analysing a triplet of cortical patches-such that one hierarchically lower patch provides the inputs on which to quantify the lAIS, a second patch at an intermediate stage in the hierarchy serves as a receiver, and the information transfer from this second patch to a third one even higher up in the processing hierarchy will provide a measure of information transferred in the outputs of the intermediate, second patch.This idea is presented in more detail in [5].Despite these difficulties, the analysis of cortico-cortical predictive coding using the proposed information-theoretic framework seems highly promising, as very explicit predictions on the type of predictive coding, the location of error-computing units in upper cortical layers, and the corresponding LFP-frequency signatures of error signal have been made [94,95].Thus, we expect these hypotheses to be directly testable using frequency-resolved measures of information transfer [96].

Conclusion
Tests of predictive coding theories are at risk of being influenced by implicit assumptions of the researchers about what a brain should predict.To circumvent this, careful design of such experiments is necessary, but may not always be possible due to a lack of the required prior knowledge of brain function.Also, such tests cannot not refute of confirm predictive coding theories based on data from other experiments not designed with predictive coding in mind-although such tests need to be performed for a theory that claims a rather broad applicability to brain function.For these reasons an analysis framework that is independent of experimental design and the experimenter's assumptions would be highly beneficial.Here we present such a framework based on the correlation between the information-theoretic equivalents of predictability and prediction errors.In a proof-of-principle analysis of the re-encoding of retinal ganglion cell inputs in the lateral geniculate nucleus principal cells we demonstrate coding for predictable information in an anesthetized animal.

S1 Appendix. Localized bias Panzeri-Treves-correction for plug-in estimators.
The bias of plug-in entropy-estimators for finite samples sizes can be analytically approximated for asymptotic sampling regimes, i.e., N ≥ |A X |, as shown by Panzeri and Treves [42,43]: where m is the alphabet size m = |A X |, and m y is the alphabet size given Y = y has occurred, m y = |A X|Y =y |.
Note that the true alphabet sizes, m and m y , are typically not known for experimental data and the number of actually observed responses can only be considered a lower bound on the true values.An estimate of m and m y can be obtained from experimental data via a Bayesian counting procedure [?, 43], for example implemented in the pyEntropy toolbox [53].
To obtain a local bias correction for lAIS and lT E using the Panzeri-Treves correction, we first provide a bias correction for local MI.We start by applying the correction to a plug-in estimator for the non-local MI (Eq 1, main text), Î(X; Y ) ≈ We may now rewrite the corrected MI as the expected value over all observations, analogous to [6].The constant term B X can be brought inside the average because of the linearity of the expected value,

Fig 1 .
Fig 1.Overview of analysis approach.A Information-theoretic measures of predictability and information transfer: active information storage (AIS) quantifies the predictability of a processes' current state x t from its immediate past x − , transfer entropy (TE) quantifies the information transfer from a source process X to a target process Y by quantifying the predictability of the target's current state, y t from the sources' past, x − , in the context of the target's immediate past, y − .B Local storage-transfer correlations (LSTC) relating local AIS (lAIS) as a measures of predictability and local TE (lT E) as a measure of information transfer: if a neuron codes for predictable input a positive correlation is expected, if the neuron codes for unpredictable input, a negative correlation is expected (adapted from[5]).C Realizations of predictive coding in the cortex (adapted from[10]): bottom-up sensory input (dotted arrows) is compared to predictions propagated in top-down direction from a hierarchically higher cortical level (solid arrows) that represent the current prior about the input (white bars).See main text.D Physiology of the retinogeniculate synapse and recording sites[11,12]: Recordings were collected from in-and outputs to the synapse between retinal ganglion cells (RGC) and Layer A principal cells (PC) in the lateral geniculate nucleus (LGN).We estimated local active information storage (lAIS, blue arrow) within the synapse input, and local transfer entropy (lT E, red arrow) between in-and output of the synapse.Schematic representations of known connections of PC and RGC are shown in grey (round markers indicate synapses): excitatory cells in layer 6 of primary visual cortex (V1) form feedback connections with LGN PC and also project to LGN inhibitory interneurons (int) and perigeniculate nucleus (PGN).Interneurons provide inhibitory input to LGN PC: intrageniculate interneurones (int) mediate feed-forward inhibition from RGC cells, while PGN cells provide recurrent inhibition[11,13]; PGN interneurons further form reciprocal, inhibitory connections amongst each other (dashed line).

Fig 2 .
Fig 2. Local storage-transfer correlations (LSTC) for exemplary cell pairs.Histograms of LSTC for representative cell pairs with highest (pairs 10 and 11) and lowest (pairs 12 and 15) contribution, respectively.The first column shows histograms for all spikes, the second column for relayed spikes, and the third column for non-relayed spikes.Relayed spikes showed positive lT E and generally positive lAIS, while non-relayed spikes led to zero or negative lT E and lower lAIS.

Fig 3 .
Fig 3. Correlation between contribution and local storage-transfer correlations (LSTC) for all spike pairs.

Fig 4 .
Fig 4. Information dynamics of relayed versus non-relayed RGC spikes.An RGC-spike was considered relayed to the LGN if it was followed by an LGN spike with the delay reconstructed as part of lT E estimation.Spike-triggered average (STA) for lAIS values for (A) all, (B) relayed and (C) non-relayed RGC-spikes; D lAIS values for each cell pair at relayed (dark blue) and non-relayed (light blue) RGC spikes (p < 0.001 * * * for a permutation test with 1000 permutations).STA for lT E values for (E) all, (F) relayed and (G) non-relayed RGC-spikes; H lT E values for each cell pair at relayed (dark red) and non-relayed (light red) RGC spikes (p < 0.001 * * * for a permutation test with 1000 permutations).STA of LGN spike train for (I) all, (J) relayed, and non-relayed (F) non-relayed RGC-spikes; STA of RGC spike train for (L) all, (M) relayed, and non-relayed (N) non-relayed RGC-spikes.

Fig 5 .
Fig 5. Information dynamics of inter-spike intervals (ISI).A Distribution of ISI pooled over all cell pairs (maximum at 3 ms, median of 26.65 ms, and standard deviation of 56.80 ms).B lAIS at RGC spike as a function of the preceding ISI (maximum at ISI= 3 ms, dashed vertical line); C lT E at RGC spike by preceding ISI (maximum at ISI= 2 ms, grey vertical line); D lAIS at relayed (dotted line) and non-relayed (dashed line) RGC spikes as functions of the preceding ISI (maxima at ISI= 2 ms for relayed spikes, dotted vertical line, and at ISI=3 ms for non-relayed spikes, dashed vertical line); E lT E at relayed (dotted line) and non-relayed (dashed line) RGC spikes as functions of the preceding ISI (maxima at ISI= 2 ms for relayed spikes, dotted vertical line, and at ISI=28 ms for non-relayed spikes, dashed vertical line).

Fig 6 .
Fig 6.Spike-triggered averages (STAs) for spike tuples.STAs for spike tuples with a silence time of 20 ms and inter-spike interval (ISI) up to 20 ms (aligned on first spike in a tuple).Left column shows lAIS values averaged over cell pairs for ISI of 1 ms to 10 ms, right column shows averaged lT E values (shaded areas indicate ±1SD).Note that lT E values are shifted by the individual delay between RGC and LGN cell for each pair.Hence, a spike at index t = 0 indicates a transferred spike with a delay corresponding to the reconstructed information transfer delay.

Fig 7 .
Fig 7. State-dependent and -independent information transfer from RGC to LGN cell.State-dependent and -independent information transfer from RGC toLGN cell, measured by the synergistic information, I syn (Y t ; X S , Y S ) (dark gray), and unique information, I unq (Y t ; X S ) (light gray).In 11 out of 16 pairs, more than half of the transferred information was predominantly state-independent.

1 2N ln( 2 )
y∈A Y x∈A X p(x, y) log 2 p(x, y) p(x)p(y) + B X − B X|Y .(22) To obtain a localized version of this estimator, note that the correction term B X is constant over x ∈ A X and y ∈ A Y , but B X|Y is not.The latter can be rewritten as a 25/35 sum over y ∈ A Y , B X|Y = y∈A Y b X|y = y∈A Y −[m y − 1], (Eq.21), such that we can write I(X; Y ) ≈ y∈A Y x∈A X p(x, y) log 2 p(x, y) p(x)p(y) − b X|y + B X , (23) where b X|y is the individual contribution of realization y ∈ A Y to the average correction term B X|Y .By dividing by the alphabet size m, we can rewrite this as I(X; Y ) ≈ y∈A Y x∈A X p(x, y) log 2 p(x, y) p(x)p(y) − b X|y m + B X .(
25.20 c(LST C, contribution) = 0.6879, p = 0.0030 S1 Fig. Distribution of optimized past variable lags.Distribution of lags of past variables identified through optimization of the non-uniform embedding for (A) lAIS, (B) lT E (source), and (C) lT E target.
Classification of relayed versus non-relayed RGC spikes.Classification accuracy of relayed versus non-relayed spikes from lAIS values, inter-spike intervals (ISI), and spike counts within a time window of 30 ms prior to an RGC spike (k-nearest neighbor classifier with k = 5, average values over ten repetitions of five-fold cross

Optimized embedding lengths and information transfer delays.
[55]e.Spike train statistics.Contribution and efficacy values are taken from[55].Optimized non-uniform embedding lengths for lAIS and lT E estimation and reconstructed information-transfer delay u for lT E estimation.
a percentage of RGC spikes preceding an LGN spike.bpercentage of LGN spikes preceded by RGC spikes.S2 Table.

Table. Local storage-transfer correlation coefficients.
Local storage-transfer correlation (LSTC) coefficients for all cell pairs with significant lAIS and lT E.