## Figures

## Abstract

Information theory allows us to investigate information processing in neural systems in terms of information transfer, storage and modification. Especially the measure of information transfer, transfer entropy, has seen a dramatic surge of interest in neuroscience. Estimating transfer entropy from two processes requires the observation of multiple realizations of these processes to estimate associated probability density functions. To obtain these necessary observations, available estimators typically assume stationarity of processes to allow pooling of observations over time. This assumption however, is a major obstacle to the application of these estimators in neuroscience as observed processes are often non-stationary. As a solution, Gomez-Herrero and colleagues theoretically showed that the stationarity assumption may be avoided by estimating transfer entropy from an ensemble of realizations. Such an ensemble of realizations is often readily available in neuroscience experiments in the form of experimental trials. Thus, in this work we combine the ensemble method with a recently proposed transfer entropy estimator to make transfer entropy estimation applicable to non-stationary time series. We present an efficient implementation of the approach that is suitable for the increased computational demand of the ensemble method's practical application. In particular, we use a massively parallel implementation for a graphics processing unit to handle the computationally most heavy aspects of the ensemble method for transfer entropy estimation. We test the performance and robustness of our implementation on data from numerical simulations of stochastic processes. We also demonstrate the applicability of the ensemble method to magnetoencephalographic data. While we mainly evaluate the proposed method for neuroscience data, we expect it to be applicable in a variety of fields that are concerned with the analysis of information transfer in complex biological, social, and artificial systems.

**Citation: **Wollstadt P, Martínez-Zarzuela M, Vicente R, Díaz-Pernas FJ, Wibral M (2014) Efficient Transfer Entropy Analysis of Non-Stationary Neural Time Series. PLoS ONE 9(7):
e102833.
https://doi.org/10.1371/journal.pone.0102833

**Editor: **Daniele Marinazzo, Universiteit Gent, Belgium

**Received: **December 20, 2013; **Accepted: **June 24, 2014; **Published: ** July 28, 2014

**Copyright: ** © 2014 Wollstadt et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Funding: **MW and RV received financial support from LOEWE Grant “Neuronale Koordination Forschungsschwerpunkt Frankfurt (NeFF)”. MMZ received financial support from the University of Valladolid. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

We typically think of the brain as some kind of information processing system, albeit mostly without having a strict definition of information processing in mind. However, more formal accounts of information processing exist, and may be applied to brain research. In efforts dating back to Alan Turing [1] it was shown that any act of information processing can be broken down into the three components of information storage, information transfer, and information modification [1]–[4]. These components can be easily identified in theoretical or technical information processing systems, such as ordinary computers, based on the specialized machinery for and the spatial separation of these component functions. In these examples, a separation of the components of information processing via a specialized mathematical formalism seems almost superfluous. However, in biological systems in general, and in the brain in particular, we deal with a form of distributed information processing based on a large number of interacting agents (neurons), and each agent at each moment in time subserves any of the three component functions to a varying degree (see [5] for an example of time-varying storage). In neural systems it is indeed crucial to understand where and when information storage, transfer and modification take place, to constrain possible algorithms run by the system. While there is still a struggle to properly define information modification [6], [7] and its proper measure [8]–[12], well established measures for (local active) information storage [13], information transfer [14], and its localization in time and space [15], [16] exist, and are applied in neuroscience (for information storage see [5], [17], [18], for information transfer see below).

Especially the measure for information transfer, transfer entropy (TE), has seen a dramatic surge of interest in neuroscience [19]–[41], physiology [42]–[44], and other fields [6], [15], [31], [45], [46]. Nevertheless, conceptual and practical problems still exist. On the conceptual side, information transfer has been for a while confused with causal interactions, and only some recent studies [47]–[49] made clear that there can be no one-to-one mapping between causal interactions and information transfer, because causal interactions will subserve all *three* components of information processing (transfer, storage, modification). However, it is information transfer, rather than causal interactions, we might be interested in when trying to understand a computational process in the brain [48].

On the practical side, efforts to apply measures of information transfer in neuroscience have been hampered by two obstacles: (1) the need to analyze the information processing in a multivariate manner, to arrive at unambiguous conclusions that are not clouded by spurious traces of information transfer, e.g. due to effects of cascades and common drivers; (2) the fact that available estimators of information transfer typically require the processes under investigation to be stationary.

The first obstacle can in principle be overcome by conditioning TE on all other processes in a system, using a fully multivariate approach that had already been formulated by Schreiber [14]. However, the naive application of this approach normally fails because the samples available for estimation are typically too few. Therefore, recently four approaches to build an approximate representation of the information transfer network have been suggested: Lizier and Rubinov [50], Faes and colleagues [44], and Stramaglia and colleagues [51] presented algorithms for iterative inclusion of processes into an approximate multivariate description. In the approach suggested by Stramaglia and colleagues, conditional mutual information terms are additionally computed at each level as a self-truncating series expansion, following a suggestion by Bettencourt and colleagues [52]. In contrast to these approaches that explicitly compute conditional TE terms, we recently suggested an approximation based on a reconstruction of information transfer delays [53] and a graphical pruning algorithm [54]. While the first three approaches will eventually be closer to the ground truth, the graphical method may be better applicable to very limited amounts of data. In sum, the first problem of multivariate analysis can be considered solved for practical purposes, given enough data are available.

The second obstacle of dealing with non-stationary processes is also not a fundamental one, as the definition of TE relies on the availability of multiple realizations of (two or more) random processes, that can be obtained by running an ensemble of many identical copies of the processes in question, or by running one process multiple times. Only when obtaining data from such copies or repetitions is impossible, we have to turn to a stationarity assumption in order to evaluate the necessary probability density functions (PDF) based on a single realization.

Fortunately, in neuroscience we can often obtain many realizations of the processes in question by repeating an experiment. In fact, this is the typical procedure in neuroscience - we repeat trials under conditions that are kept as constant as possible (i.e we create a cyclostationary process). The possibility to use such an *ensemble* of data to estimate the time resolved TE has already been demonstrated theoretically by Gomez-Herrero and colleagues [55]. Practically, however, the statistical testing necessary for this ensemble-based method leads to an increase in computational cost by several orders of magnitude, as some shortcuts in statistical validation that can be taken for stationary data cannot be used for the ensemble approach (see [56]): For stationary data, TE is calculated per trial and *one* set of trial-based surrogate data may be used for statistical testing. The ensemble method does not allow for trial-based TE estimation as TE is estimated across trials. Instead, the ensemble method requires the generation of a sufficiently large number of surrogate data sets, for *all* of which TE has to be estimated, thus multiplying the computational demand by the number of surrogate data sets. Therefore, the use of the ensemble method has remained a theoretical possibility so far, especially in combination with the nearest neighbor-based estimation techniques by Kraskov and colleagues [57] that provide the most precise, yet computationally most heavy TE estimates. For example, the analysis of magnetoencephalographic data presented here would require a runtime of 8200 h for 15 subjects and a single experimental condition. It is easy to see that any practical application of the methods hinges on a substantial speed-up of the computation.

Fortunately, the algorithms involved in ensemble-based TE estimation, lend themselves easily to data-parallel processing, since most of the algorithm's fundamental parts can be computed simultaneously. Thus, our problem matches the massively parallel architecture of Graphics Processing Unit (GPU) devices well. GPUs were originally devised only for computer graphics, but are routinely used to speed up computations in many areas today [58], [59]. Also in neuroscience, where applied algorithms continue to grow faster in complexity than the CPU performance, the use of GPUs with data-parallel methods is becoming increasingly important [60] and GPUs have successfully been used to speedup time series analysis in neuroscientific experiments [61]–[66].

Thus, in order to overcome the limitations set by the computational demands of TE analysis from an ensemble of data, we developed a GPU implementation of the algorithm, where the neighbor searches underlying the binless TE estimation [57] are executed in parallel on the GPU. After parallelizing this computationally most heavy aspect of TE estimation we were able to use the ensemble method for TE estimation proposed by [55], to estimate time-resolved TE from non-stationary neural time-series in acceptable time. Using the new GPU-based TE estimation tool on a high-end consumer graphics card reduced computation time by a factor of 50 compared to the CPU optimized TE search used previously [67]. In practical terms, this speedup shortens the duration of an ensemble-based analysis for typical neural data sets enough to make the application of the ensemble method feasible for the first time.

## Background

Our study focuses on making the application of ensemble-based estimation of TE from non-stationary data practical using a GPU-based algorithm. For the convenience of the reader, we will also present the necessary background on stationarity, TE estimation using the Kraskov-Stögbauer-Grassberger (KSG) estimator [19], and the ensemble method of Gomez-Herrero et al. [55] in condensed form in a short background section below. Readers well familiar with these topics can safely skip ahead to the *Implementation* section below.

### Notation

To describe practical TE estimation from time series recorded in a system of interest (e.g. a brain area), we first have to formalize these recordings mathematically: We define an observed time series as a realization of a random process . A random process here is simply a collection of individual random variables sorted by an integer index , representing time. TE or other information theoretic functionals are then calculated from the random variables' joint PDFs and conditional PDFs (with ), where and are all possible outcomes of the random variables and , and where .

We call information theoretic quantities functionals as they are defined as functions that map from the space of PDFs to the real numbers. If we have to estimate the underlying probabilities from experimental data first, the mapping from the data to the information theoretic quantity (a real number) is called an estimator.

### Stationarity and non-stationarity in experimental time series

PDFs in neuroscience are typically not known *a priori*, so in order to estimate information theoretic functionals, these PDFs have to be reconstructed from a sufficient amount of observed realizations of the process. How these realizations are obtained from data depends on whether the process in question is stationary or non-stationary. Stationarity of a process means that PDFs of the random variables that form the random process do not change over time, such that . Any PDF may then be estimated from one observation of process by means of collecting realizations *over time* .

For processes that do not fulfill the stationarity-assumption, temporal pooling is not applicable as PDFs vary over time and some random variables , (at least two) are associated with different PDFs , (Figure 1). To still gain the necessary multiple observations of a random variable we may resort to either run multiple physical copies of the process or – in cases where physical copies are unavailable – we may repeat a process in time. If we choose the number of repetitions large enough, i.e. there is a sufficiently large set of time points , at which the process is repeated, we can assume that(1)

(A) Schematic account of TE. Two scalar time series and recorded from the repetition of processes and , coupled with a delay (indicated by green arrow). Colored boxes indicate delay embedded states , for both time series with dimension samples (colored dots). The star on the time series indicates the scalar observation that is obtained at the target time of information transfer . The red arrow indicates self-information-transfer from the past of the target process to the random variable at the target time. is chosen such that and influences of the state arrive exactly at the information target variable . Information in the past state of is useful to predict the future value of and we obtain nonzero TE. (B) To estimate probability density functions for , and at a certain point in time , we collect their realizations from observed repetitions . (C) Realizations for a single repetition are concatenated into one embedding vector and (D) combined into one ensemble state space. Note, that data are pooled over the ensemble of data instead of time. Nearest neighbor counts within the ensemble state space can then be used to derive TE using the Kraskov-estimator proposed in [57].

i.e. PDFs at time point relative to the onset of the repetition at are equal over all repetitions. We call the repeated observations of a process an *ensemble* of time series. We may obtain a reliable estimation of from this ensemble by evaluating over all observations . For the sake of readability, we will refer to these observations from the ensemble as , where refers to a time point , relative to the beginning of the process at time , and refers to the index of the repetition. If a process is repeated periodically, i.e. the repetitions are spaced by a fixed interval , we call such a process cyclostationary [68]:(2)

In neuroscience, ensemble evaluation for the estimation of information theoretic functionals becomes relevant as physical copies of a process are typically not available and stationarity of a process can not necessarily be assumed. Gomez-Herrero and colleagues recently showed how ensemble averaging may be used to nevertheless estimate information theoretic functionals from cyclostationary processes [55]. In neuroscience for example, a cyclostationary process, and thus an ensemble of data, is obtained by repeating an experimental manipulation, e.g. the presentation of a stimulus; these repetitions are often called experimental *trials*. In the remainder of this article, we will use the term repetition, and interpret trials from a neuroscience experiment as a special case of repetitions of a random process. Building on such repetitions, we next demonstrate a computationally efficient approach to the estimation of TE using the ensemble method proposed in [55].

### Transfer entropy estimation from an ensemble of time series

#### Ensemble-based TE functional.

When independent repetitions of an experimental condition are available, it is possible to use ensemble evaluation to estimate various PDFs from an ensemble of repetitions of the time series [55]. By eliminating the need for pooling data over time, and instead pooling over repetitions, ensemble methods can be used to estimate information theoretic functionals for non-stationary time series. Here, we follow the approach of [55] and present an ensemble TE functional that extends the TE functional presented in [19], [20], [53] and also takes into account an extension of the original formulation of TE, presented in [53], guaranteeing self prediction optimality (indicated by the subscript ). In the next subsection, we will then present a practical and data-efficient estimator of this functional. The functional reads(3)where is the conditional mutual information, and , , and are the current value and the -dimensional past state variables of the target process Y, and the -dimensional past state variable at time of the source process X, respectively (see next paragraph for an explanation of states).

Rewriting this, taking into account repetitions of the random processes explicitly we obtain:(4)

Here, is the assumed delay of the information transfer between processes and [53]; denotes the future observation of in repetition ; denotes the past state of in repetition and denotes the past state of in repetition . Note, that the functional used here is a modified form of the original TE formulation introduced by Schreiber [14]. Schreiber defined TE as a conditional mutual information , whereas the functional in eq. 3 implements the conditional mutual information [53]. The latter functional, , contains the definition of Schreiber as a special case for . Note that the two functionals are identical if is used with the physically correct delay (i.e. ) and a proper embedding for the source, and the Schreiber measures is used with an over-embedding such that the source state at is still fully covered by the source embedding.

In addition to the original formulation of in [53], here we explicitly state that the necessary realizations of the random variables in question are obtained through *ensemble evaluation* over repetitions – assuming the underlying processes to be repeatable or cyclostationary. Furthermore, we note explicitly that this ensemble-based functional introduces the possibility of time resolved TE estimates.

We recently showed that the estimator presented in [53] can also be used to recover an unknown information transfer delay between two processes and , as is maximal when the assumed delay is equal to the true information transfer delay [53]. This holds for the extended estimator presented here, thus(5)

#### State space reconstruction and practical estimator.

Transfer entropy differs from the lagged mutual information by the additional conditioning on the past of the target time series, . This additional conditioning serves two important functions. First, as mentioned already by Schreiber in the original paper [14], and later detailed by Lizier [4] and Wibral and colleagues [39], [53], it removes the information about the future of the target time-series that is already contained in its own past, . Second, this additional conditioning allows for a discovery of information transfer from the source to the target that can only be seen when taking into account information from the past of the target [69]. In the second case, the past information from the target serves to ‘decode’ this information transfer, and acts like a key in cryptography. As a consequence of this importance of the past of the target process it is very important to take all the necessary information in this past into account when evaluating the TE as in equation 4.

To this end we need to form a collection of past random variables(6)such that their realizations,(7)are maximally informative about the future of the target process, .

This task is complicated by the fact the we often deal with multidimensional systems, of which we only observe a scalar variable (here modeled as our random processes X,Y). To see this, think for example of a pendulum (which is a two dimensional system) of which we record only the current position . If the pendulum is at its lowest point, it could be standing still, going left, or going right. To properly describe which state the pendulum is in, we need to know at least the realization of one more random variable back in time. Collections of such past random variables whose realizations uniquely describe the state of a process are called *state variables*.

Such a sufficient collection of past variables, called a delay embedding vector, can always be reconstructed from scalar observations for low dimensional deterministic systems, such as the above pendulum, as shown by Takens [70]. Unfortunately, most real world systems are high-dimensional stochastic dynamic systems (best described by non-linear Langevin equations) rather than low-dimensional deterministic ones. For these systems it is not obvious that a delay embedding similar to Takens' approach would yield the desired results. In fact, many systems can be shown to require an infinite number of past random variables when only a scalar observable of the high-dimensional stochastic process is accessible. Nevertheless, as shown by Ragwitz and Kantz [71], the behavior of scalar observables of most of these systems can be approximated very well by a finite collection of such past variables for all practical purposes; in other words, these systems can be approximated well by a finite order, one-dimensional Markov-process.

For practical TE estimation using equation 4, we therefore proceed by first reconstructing the state variables of such approximated Markov processes for the two systems , from their scalar time series. Then, we use the statistics of nearest ensemble neighbors with a modified KSG estimator for TE evaluation [57].

Thus, we select a delay embedding vector of the form from equation 6 as our collection of past random variables – with realizations in repetition given by . Here, is called the embedding dimension and the embedding delay. These embedding parameters and , are chosen such that they optimize a local predictor [71], as this avoids an overestimation of TE [53]; other approaches related to minimizing non-linear prediction errors are also possible [44]. In particular, is chosen such that is conditionally independent of any with given . The same is done for the process X at time .

Next, we decompose into a sum of four individual Shannon entropies:(8)

The Shannon differential entropies in equation 8 can be estimated in a data efficient way using nearest neighbor techniques [72], [73]. Nearest neighbor estimators yield a non-parametric estimate of entropies, assuming only a smoothness of the underlying PDF. It is however problematic to simply apply a nearest neighbor estimator (for example the Kozachenko-Leonenko estimator [72]) to each term appearing in eq. 8. This is because the dimensionality of each space associated with the terms differs largely over terms. Thus, a fixed number of neighbors for the search would lead to very different spatial scales (range of distances) for each term. Since the error bias of each term is dependent on these scales, the errors would not cancel each other but accumulate. We therefore use a modified KSG estimator which handles this problem by only fixing the number of neighbors in the highest dimensional space (k-nearest neighbor search, kNNS) and by projecting the resulting distances to the lower dimensional spaces as the range to look for and count neighbors there (range search, RS) (see [57], type 1 estimator, and [56], [74]). In the ensemble variant of TE estimation we proceed by searching for nearest neighbors across points from all repetitions instead of searching the same repetition as the point of reference of the search – thus we form an *ensemble search space* by combining points over repetitions. Finally, the ensemble estimator of TE reads(9)where denotes the digamma function and the angle brackets () indicate an averaging over points in different repetitions at time instant . The distances to the -th nearest neighbor in the highest dimensional space (spanned by ) define the radius of the spheres for the counting of the number of points () in these spheres around each state vector () involved.

In cases where the number of repetitions is not sufficient to provide the necessary amount of data to reliably estimate Shannon entropies through an ensemble average, one may combine ensemble evaluation with collecting realizations over time. In these cases, we count neighbors in a time window with , where controls the temporal resolution of the TE estimation:(10)

## Implementation

The estimation of TE from finite time series consists of the estimation of joint and marginal entropies as shown in equations 9 and 10, calculated from nearest neighbor statistics, i.e. distances and the count of neighbors within these distances. In practice we obtain these neighbor counts by applying kNNS and RS to reconstructed state spaces. In particular, we use a kNNS in the highest dimensional space to determine the k-th nearest neighbor of a data point and the associated distance. This distance is then used as the range for the RS in the marginal spaces, that return the point counts *n*. Both searches have a high computational cost. This cost increases even further in a practical setting, where we need to calculate TE for a sufficient number of surrogate data sets for statistical testing (see [19] and below for details). To enable TE estimation and statistical testing despite its computational cost, we implemented ad-hoc kNNS and RS algorithms in NVIDIA® CUDA™ C/C++ code [75]. This allows to run thousands of searches in parallel on a modern GPU.

To allow for a better understanding of the parallelization used, we will now briefly describe the main work flow of TE analysis in the open source MathWorks® MATLAB® toolbox TRENTOOL [56], which implements the approach to TE estimation described in the *Background* section. The work flow includes the steps of data preprocessing prior to the use of the GPU algorithm for neighbor searches as well as the statistical testing of resulting TE values. In a subsequent section we will describe the core implementation of the algorithm in more detail and present its integration into TRENTOOL.

### Main analysis work flow in TRENTOOL

#### Practical TE estimation in TRENTOOL.

The practical GPU-based TE estimation in TRENTOOL 3.0 is divided into the two steps of data preparation and TE estimation (see Figure 2 and the TRENTOOL 3.0 manual: http: www.trentool.de). As a first step, data is prepared by optimizing embedding parameters for state space reconstruction (Figure 2, panel **A**). As a second step, TE is estimated by following the approach for ensemble-based TE estimation lined out in the preceding section (Figure 2, panel **B**). TRENTOOL estimates (eq. 4) for a given pair of processes and and given values for and . For each pair, we call the source and the target process.

(A) Data preparation and optimization of embedding parameters in function TEprepare.m; (B) transfer entropy (TE) estimation from prepared data in TEsurrogatestats_ensemble.m (yellow boxes indicate variables being passed between sub-functions). TE is estimated via iterating over all channel combinations provided in the data. For each channel combination: (1) Data is embedded individually per repetition and combined over repetitions into one ensemble state space (chunk), (2) surrogate data sets are created by shuffling the repetitions of the target time series, (3) each surrogate data set is embedded per repetition and combined into one chunk (forming chunks in total), (4) chunks of original and surrogate data are passed to the GPU where nearest neighbor searches are conducted in parallel, (5) calculation of TE values from returned neighbor counts for original data and surrogate data sets using the KSG-estimator [57], (6) statistical testing of original TE value against distribution of surrogate TE values; (C) output of TEsurrogatestats_ensemble.m, an array with dimension [no. channels5], where rows hold results for all channel combinations: (1) p-value of TE for this channel combination, (2) significance at the designated alpha level (1 - significant, 0 - not significant), (3) significance after correction for multiple comparisons, (4) absolute difference between the TE value for original data and the median of surrogate TE values, (5) presence of volume conduction (this is always set to 0 when using the ensemble method as instantaneous mixing is by default controlled for by conditioning on the current state of the source time series [119]).

After data preparation (eq. 9 and 10) is estimated in six steps: (1) using optimized embedding parameters, original data is embedded per repetition and repetitions are concatenated forming the ensemble search space of the original data, (2) sets of surrogate data are created from the original data by shuffling the repetitions of the target process , (3) each surrogate dataset is embedded per repetition and concatenated forming additional ensemble search spaces for surrogate data, (4) all search spaces of embedded original and surrogate data are passed to a wrapper function that calls the GPU functions to perform individual neighbor searches for each search space in parallel (in the following, we will refer to each of the ensembles as one data *chunk*), (5) TE values are calculated for original and surrogate data chunks from the neighbor counts using the KSG- estimator [57], (6) TE values for original data are tested statistically against the distribution of surrogate TE values.

The proposed GPU algorithm is accessed in step (4). As we will further explain below (see paragraph on *Input data*), the GPU implementation uses the fact that all of the necessary computations on surrogate data sets and the original data are independent and can thus be performed in parallel.

#### TE calculation and statistical testing against surrogate data.

Estimated TE values need to be tested for their statistical significance [56] (step (6) of the main TRENTOOL work flow). For this statistical test under a null hypothesis of *no* information transfer between a source and target time series , we estimate and compare it to a distribution of TE values calculated from surrogate data sets. Surrogate data sets are formed by shuffling repetitions in to obtain , such that and , where denotes a random permutation of the repetitions (Figure 3). From this surrogate data set, we calculate surrogate TE values . By repeating this process a sufficient number of times , we obtain a distribution of values . To asses the statistical significance of , we calculate a p-value as the proportion of surrogate TE values equal or larger than . This p-value is then compared to a critical alpha level (see for example [56], [76]).

(A) Original time series with information transfer (solid arrow) from a source state to a corresponding target time point , given the time point's history . Solid arrows indicate the direction of transfer entropy (TE) analysis, while information transfer is present. (B) Shuffled target time series, repetitions are permutes, such that and , where denotes a random permutation. Dashed arrows indicate the direction of TE analysis, while no more information flow is present.

#### Reconstruction of information transfer delays.

may be used to reconstruct the interaction transfer delay between and (eq. 5, [53]). may be reconstructed by *scanning* possible values for : is estimated for all values in ; The value that maximizes the is kept as the reconstructed information transfer delay. We used the reconstruction of information transfer delays as an additional parameter when testing the proposed implementation for correctness and robustness.

### Implementation of the GPU algorithm

#### Parallelized nearest neighbor searches.

The KSG estimator used for estimating in eq. 9 and 10 uses neighbor (distance-)statistics obtained from kNNS and RS algorithms to estimate Shannon differential entropies. Thus, the choice of computationally efficient kNNS and RS algorithms is crucial to any practical implementation of the estimator. kNNS algorithms typically return a list of the k nearest neighbors for each reference point, while RS algorithms typically return a list of all neighbors within a given range for each reference point. kNNS and RS algorithms have been studied extensively because of their broad potential for application in nearest neighbor searches and related problems. Several approaches have been proposed to reduce their high computational cost: partitioning of input data into k-d Trees, Quadtrees or equivalent data structures [77] or approximation algorithms (ANN: Approximate Nearest Neighbors) [78], [79]. Furthermore, some authors have explored how to parallelize the kNNS algorithm on a GPU using different implementations: exhaustive brute force searches [80], [81], tree-based searches [82], [83] and ANN searches [83], [84].

Although performance of existing implementations of kNNS for GPU was promising, they were not applicable to TE estimation. The most critical reason was that existing implementations did not allow for the concurrent treatment of several problem instances by the GPU and maximum performance was only achieved for very large kNNS problem instances. Unfortunately, the problem instances typically expected in our application are numerous (i.e. problem instances per pair of time series), but rather small compared to the main memory on a typical GPU device in use today. Thus, an implementation that handled only one instance at a time would not have made optimal use of the underlying hardware. Therefore, we designed an implementation that is able to handle several problem instances at once to perform neighbor searches for chunks of embedded original and surrogate data in parallel. Moreover, we aimed at a flexible GPU implementation of kNNS and RS that maximized the use of the GPU's hardware resources for variable configurations of data – thus making the implementation independent of the design of the neuroscientific experiment.

Our implementation is written in CUDA (Compute Unified Device Architecture) [75] (a port to OpenCL™ [85] is work in progress). CUDA is a parallel computing framework created by NVIDIA that includes extensions to high level languages such as C/C++, giving access to the native instruction set and memory of the parallel computational elements in CUDA enabled GPUs. Accelerating an algorithm using CUDA includes translating it into data-parallel sequences of operations and then carefully mapping these operations to the underlying resources to get maximum performance [58], [59]. To understand the implementation suggested here, we will give a brief explanation of these resources, i.e. the GPU's hardware architecture, before explaining the implementation in more detail (additionally, see [58], [59], [75]).

#### GPU resources.

GPU resources comprise of massively parallel processors with up to thousands of cores (processing units). These cores are divided among Stream Multiprocessors (SMs) in order to guarantee automatic scalability of the algorithms to different versions of the hardware. Each SM contains 32 to 192 cores that execute operations described in the CUDA kernel code. Operations executed by one core are called a CUDA thread. Threads are grouped in blocks, which are in turn organized in a grid. The grid is the entry point to the GPU resources. It handles one kernel call at a time and executes it on multiple data in parallel. Within the grid, each block of threads is executed by one SM. The SM executes the threads of a block by issuing them in groups of 32 threads, called warps. Threads within one warp are executed concurrently, while as many warps as possible are scheduled per SM to be resident at a time, such that the utilization of all the cores is maximized.

#### Input data.

As input, the proposed RS and kNNS algorithms expect a set of data points representing the search space and a second set of data points that serve as reference points in the searches. One such problem instance is considered one data chunk. Our implementation is able to handle several data chunks simultaneously to make maximum use of the GPU resources. Thus, several chunks may be combined, using an additional index vector to encode the sizes of individual chunks. These chunks are then passed at once to the GPU algorithm to be searched in parallel.

In the estimation of , according to the work flow described in paragraph *Practical TE estimation in TRENTOOL*, we used the proposed implementation to parallelize neighbor searches over surrogate data sets for a given pair of time series and and given values for and . Thus, in one call to the GPU algorithms data chunks were passed as input, where chunks represented the search space for the original pair of time series and search spaces for corresponding surrogate data sets. Points within the search spaces may have either been collected through temporal or ensemble pooling of embedded data points or a combination of both (eq. 9 or 10).

#### Core algorithm.

In the core GPU-based search algorithm, the kNNS implementation is mapped to CUDA threads as depicted in Figure 4 (the RS implementation behaves similarly). Each chunk consists of a set of data points that represents the search space and are at the same time used as reference points for individual searches. Each individual search is handled by one CUDA thread. Parallelization of these searches on the GPU happens in two ways: (1) the GPU algorithm is able to handle several chunks, (2) each chunk can be searched in parallel, such that individual searches within one chunk are handled simultaneously. An individual search is conducted by a CUDA thread by brute-force measuring the infinity norm distance of the given reference point to any other point within the same chunk. Simultaneously, other threads measure these distances for other points in the same chunk or handle a different chunk altogether. Searching several chunks in parallel is an essential feature of the proposed solution, that maximizes the utilization of GPU resources. From the GPU execution point of view, simultaneous searches are realized by handling a variable number of kNNS (or RS) problem instances through one grid launch. The number of searches that can be executed in parallel is thus only limited by the device's global memory that holds the input data and the number of threads that can be started simultaneously (both limitations are taken into account). Furthermore, the solution is implemented such that optimal performance is guaranteed.

Chunks of data are prepared on the CPU (embedding and concatenation) and passed to the GPU. Data points are managed in the global memory as Structures of Arrays (SoA). To make maximum use of the memory bandwidth, data is padded to ensure coalesced reading and writing from and to the streaming multiprocessor (SM) units. Each SM handles one chunk in one thread block (dashed box). One block conducts brute force neighbor searches for all data points in the chunk and collects results in its shared memory (red and blue arrows and shaded areas). Results are eventually returned to the CPU.

#### Low-level implementation details.

There are several strategies that are essential for optimal performance when implementing algorithms for GPU devices. Most important are the reduction of memory latencies and the optimal use of hardware resources by ensuring high occupancy (the ratio of number of active warps per SM to the maximum number of possible active warps [58]). To maximize occupancy, we designed our algorithm's kernels such that always more than one block of threads (ideally many) are loaded per SM [58]. We can do this since many searches are executed concurrently in every kernel launch. By maximizing occupancy, we both ensure hardware utilization and improve performance by hiding data memory latency from the GPU's global memory to the SMs' registers [75]. Moreover, in order to reduce memory latencies we take care of input data memory alignment and guarantee that memory readings issued by the threads of a warp are coalesced into as few memory transfers as possible. Additionally, with the aim of minimizing sparse data accesses to memory, data points are organized as Structures of Arrays (SoA). Finally, we use the shared memory inside the SMs (a self-programmed intermediate cache between global memory and SMs) to keep track of nearest neighbors associated information during searches. The amount of shared memory and registers is limited in a SM. The maximum possible occupancy depends on the number of registers and shared memory needed by a block, which in turn depends on the number of threads in the block. For our implementation, we used a suitable block size of 512 threads.

#### Implementation interface.

The GPU functionality is accessed through MATLAB scripts for kNNS (‘fnearneigh_gpu.mex’) and RS (‘range_search_all_gpu.mex’), which encapsulate all the associated complexity. Both scripts are called from TRENTOOL using a wrapper function. In its current implementation in TRENTOOL (see paragraph *Practical TE estimation in TRENTOOL*), the wrapper function takes all chunks as input and launches a kernel that searches all chunks in parallel through the mex-files for kNNS and RS. The wrapper makes sure that the input size does not exceed the GPU device's available global memory and the maximum number of threads that can be started simultaneously. If necessary, the wrapper function splits the input into several kernel calls; it also manages the output, i.e. the neighbor counts for each chunk, which are passed on for TE calculation.

## Evaluation

To evaluate the proposed algorithm we investigated four properties: first, whether the speedup is sufficient to allow the application of the method to real-world neural datasets; second, the correctness of results on simulated data, where the ground truth is known; third, the robustness of the algorithm for limited sample sizes; fourth, whether plausible results are achieved on a neural example dataset.

### Ethics statement

The neural example dataset was taken from an experiment described in [86]. All subjects gave written informed consent before the experiment. The study was approved by the local ethics committee (Johann Wolfgang Goethe University, Frankfurt, Germany).

### Evaluation of computational speedup

To test for an increase in performance due to the parallelization of neighbor searches, we compared practical execution times of the proposed GPU implementation to execution times of the serial kNNS and RS algorithms implemented in the MATLAB toolbox TSTOOL (http: www.dpi.physik.uni-goettingen.de/tstool/). This toolbox wraps a FORTRAN implementation of kNNS and RS, and has proven the fastest CPU toolbox for our purpose. All testing was done in MATLAB 2008b (MATLAB 7.7, The MathWorks Inc., Natick, MA, 2008). As input, we used increasing numbers of chunks of simulated data from two coupled Lorenz systems, further described below. Repetitions of simulated time series were embedded and combined to form ensemble state spaces, i.e. chunks of data (c.f. paragraph *Input Data*). To obtain increasing input sizes, we duplicated these chunks the desired number of times. While the CPU implementation needed to iteratively perform searches on individual chunks, the GPU implementation searched chunks in parallel (note that chunks are treated independently here, so that there is no speedup because of the duplicated chunk data). Note that for both, CPU and GPU implementations, data handling prior to nearest neighbor searches is identical. We were thus able to confine the testing of performance differences to the respective kNNS and RS algorithms only, as all data handling prior to nearest neighbor searches was conducted using the same, highly optimized TRENTOOL functionalities.

Analogous to TE estimation implemented in TRENTOOL, we conducted one kNNS (with , TRENTOOL default, see also [87]) in the highest dimensional space and used the returned distances for a RS in one lower dimensional space. Both functions were called for increasing numbers of chunks to obtain the execution time as a function of input size. One chunk of data from the highest dimensional space had dimensions [3009417] and size 1.952 MB (single precision); one chunk of data from the lower dimensional space had dimensions [300948] and size 0.918 MB (single precision). Performance testing of the serial implementation was carried out on an Intel Xeon CPU (E5540, clocked at 2.53 GHz), where we measured execution times of the TSTOOL kNNS (functions ‘nn_prepare.m’ and ‘nn_search.m’) and the TSTOOL RS (function ‘range_search.m’). Testing of the parallel implementation was carried out three times on GPU devices of varying processing power (NVIDIA Tesla C2075, GeForce GTX 580 and GeForce GTX Titan). On the GPUs, we measured execution times for the proposed kNNS (‘fnearneigh_gpu.mex’) and RS (‘range_search_all_gpu.mex’) implementation. When the GPU's global memory capacity was exceeded by higher input sizes, data was split and computed over several runs (i.e. calls to the GPU). All performance testing was done by measuring execution times using the MATLAB functions tic and toc.

To obtain reliable results for the serial implementation we ran both kNNS and RS 200 times on the data, receiving an average execution time of 1.26 s for kNNS and an average execution time of 24.1 s for RS. We extrapolated these execution times to higher numbers of chunks and compared them to measured execution times of the parallel searches on three NVIDIA GPU devices. On average, execution times on the GPU compared to the CPU were faster by a factor of 22 on the NVIDIA Tesla C2075, by a factor of 33 for the NVIDIA GTX 580 and by a factor of 50 for the NVIDIA GTX Titan (Figure 5).

Combined execution times in s for serial and parallel implementations of k-nearest neighbor and range search as a function of input size (number of data chunks). Execution times were measured for the serial implementation running on a CPU (black) and for our parallel implementation using one of three GPU devices (blue, red, green) of varying computing power. Computation using a GPU was considerably faster than using a CPU (by factors 22, 33 and 50 respectively).

To put these numbers into perspective, we note that in a neuroscience experiment the number of chunks to be processed is the product of (typical numbers): channel pairs for TE (100) * number of surrogate data sets (1000) * experimental conditions (4) * number of subjects (15). This results in a total computational load on the order of chunks to be processed. Given an execution time of 24.1 s/50 on the NVIDIA GTX Titan for a typical test dataset, these computations will take or 4.8 weeks on a single GPU, which is feasible compared to the initial duration of 240 weeks on a single CPU. Even when considering a trivial parallelization of the computations over multiple CPU cores and CPUs, the GPU based solution is by far more cost and energy efficient than any possible CPU-based solution. If in addition a scanning of various possible information transfer delays is important, then parallelization over multiple GPUs seems to be the only viable option.

### Evaluation on Lorenz systems

To test the ability of the presented implementation to successfully reconstruct information transfer between systems with a non-stationary coupling, we simulated various coupling scenarios between stochastic and deterministic systems. We introduced non-stationary into the coupling of two processes by varying the coupling strength over the course of a repetition (all other parameters were held constant). Simulations for individual scenarios are described in detail below. For the estimation of TE we used MathWork's MATLAB, and the TRENTOOL toolbox extended by the implementation of the ensemble method proposed above (version 3.0, see also [56] and http: www.trentool.de). For a detailed testing of the used estimator (eq. 4) refer to [53].

#### Coupled Lorenz systems.

Simulated data was taken from two unidirectionally coupled Lorenz systems labeled and . Systems interacted in direction according to equations:(11)where , is the coupling delay and is the coupling strength; , and are the *Prandtl number*, the *Rayleigh number*, and a geometrical scale. Note, that for the test cases (no self feedback, no coupling from to ). Numerical solutions to these differential equations were computed using the *dde23* solver in MATLAB and results were resampled such that the delays amounted to the values given below. For analysis purposes we analyzed the V-coordinates of the systems.

We introduced non-stationarity in the coupling between both systems by varying the coupling strength over time. In particular, a coupling was set for a limited time interval only, whereas before and after the coupling interval was set to . A constant information transfer delay was simulated for the whole coupling interval. We simulated 150 repetitions with 3000 data points each, with a coupling interval from approximately 1000 to 2000 data points (see Figure 6, panel **A**).

We used two dynamically coupled Lorenz systems (A) to simulate non-stationarity in data generating processes. A coupling was present during a time interval from 1000 to 2000 ms only ( otherwise). The information transfer delay was set to . Transfer entropy (TE) values were reconstructed using the ensemble method combined with the scanning approach proposed in [53] to reconstruct information transfer delays. Assumed delays were scanned from 35 to 55 ms (1 ms resolution). In (B) the maximum TE values for original data over this interval are shown in blue. Red bars indicate the corresponding mean over surrogate TE values (error bars indicate 1 SD). Significant TE was found for the second time window only; here, the delay was reconstructed as .

For each scenario, 500 surrogate data sets were computed to allow for statistical testing of the reconstructed information transfer. Surrogate data were created by permutation of data points in blocks of the target time series (Figure 3), leaving each repetition intact. The value for the nearest neighbor search was set to 4 for all analyses (TRENTOOL default, see also [87]).

#### Results.

We analyzed data from three time windows from 200 to 450 ms, 1600 to 1850 ms and 2750 to 3000 ms using the estimator proposed in eq. 10 with , assuming local stationarity (Figure 6, panel **A**). For each time window, we scanned assumed delays in the interval . Figure 6, panel **B**, shows the maximum TE value from original data (blue) over all assumed and the corresponding mean surrogate TE value (red). Significant differences between original TE and surrogate TE were found in the second time window only (indicated by an asterisk). No significant information transfer was found during the non-coupling intervals. The information transfer delay reconstructed for the second analysis window was 49 ms (true information transfer delay ). Thus, the proposed implementation was able to reliably detect a coupling between both systems and reconstructed the corresponding information transfer delay with an error of less than 10%.

### Evaluation on autoregressive processes

To asses the performance of the proposed implementation on non-abrupt changes in coupling, we simulated various coupling scenarios for two autoregressive processes , of order 1 (AR(1)-processes) with variable couplings over time. In each scenario, couplings were modulated using hyperbolic functions to realize a smooth transition between uncoupled and coupled regimes. The AR(1)-processes were simulated according to the equations(12)

(13)where , are the AR parameters, , denote coupling strength, , are the coupling delays and , denote uncorrelated, unit-variance, zero-mean Gaussian white noise terms.

#### Simulated coupling scenarios.

We simulated three coupling scenarios, where the coupling varied in strength over the course of a repetition (duration 3000 ms): (1) unidirectional coupling with a coupling onset around 1000 ms; (2) unidirectional coupling with a two-step increase in coupling at around 1000 ms and around 2000 ms; (3) bidirectional coupling with onset around 1000 ms and with onset around 2000 ms. See table 1 for specific parameter values used in each scenario.

We realized a varying coupling strength (and for scenario (3)) by modulating coupling parameters , with a hyperbolic tangent function. No coupling was realized by setting . For scenarios (1) and (3) we used the coupling(14)

(15)where 0.05 was the slope and 2000 and 1000 are the inflection points of the hyperbolic tangent respectively. Note that we additionally scaled the tanh function such that function value ranged from 0 to 1. For coupling scenario (2), the two-step increase in was expressed as:(16)

We chose the arguments of the hyperbolic function such that the function's slope led to a smooth increase in the coupling over an epoch of approximately 200 ms around the inflection points at 1 and 2 s respectively (Figure 7, panels **A**–**D**). For each scenario, we simulated 50 trials of length 3000 ms with a sampling rate of 1000 Hz. We then estimated time resolved TE for analysis windows of length . Again, we mixed temporal and ensemble pooling according to eq. 10. For the scenario with unidirectional coupling (1) we used four analysis windows to cover the change in coupling (from 0.2 to 0.5 s, 0.5 to 0.8 s, 0.8 to 1.1 s, and 1.1 to 1.4 s, see Figure 7, panel **E**), for the two-step increase (2) and bidirectional (3) scenarios, we used eight analysis windows each (from 0.2 to 0.5 s, 0.5 to 0.8 s, 0.8 to 1.1 s, 1.1 to 1.4 s, 1.4 to 1.7 s, 1.7 to 2.0 s, 2.0 to 2.3 s, and 2.3 to 2.6 s, see Figure 7, panels **F** and **G**). As for the Lorenz systems, 500 surrogate data sets were used for the statistical testing in each analysis. Surrogate data were created by blockwise (i.e. repetitionwise) permutation of data points in the target time series. The value for the nearest neighbor search was set to 4 for all analyses (TRENTOOL default, see also [87]).

We simulated two dynamically coupled autoregressive processes (A) with coupling delays and , and coupling scenarios: (B) unidirectional coupling (blue line) with onset around 1 s, coupling set to 0 (red line); (C) unidirectional coupling (blue line) with onset around 1 s and an increase in coupling strength at around 2 s, coupling set to 0 (red line); (D) bidirectional coupling (blue line) with onset around 1 s and (red line) with onset around 2 s. (E-G) Time-resolved transfer entropy (TE) for both directions of interaction, blue and red lines indicate raw TE values for and respectively. Dashed lines denote significance thresholds at 0.01% (corrected for multiple comparisons over signal combinations). Shaded areas (red and blue) indicate the maximum absolute TE values for significant information transfer (indicated by asterisks in red and blue). (E) TE values for unidirectional coupling; (F) unidirectional coupling with a two-step increase in coupling strength; (G) bidirectional coupling.

#### Results – Scenario (1), unidirectional coupling.

For scenario (1) of two unidirectionally coupled AR(1)-processes with a delay , we used a scanning approach [53] to reconstruct TE and the corresponding information transfer delay. We scanned assumed delays in the interval and used four analysis windows of length 300 ms each, ranging from 0.2 to 1.4 s. For the first two analysis windows, no significant information transfer was found (0.2 to 0.5 and 0.5 to 0.8 s). For the third and fourth analysis window we detected significant TE, where we found a maximum significant TE value at 7 ms for the third analysis window (0.8 to 1.1 s) and a maximum at 9 ms for the fourth window (1.1 to 1.4 s). Thus, the proposed implementation was able to detect information transfer between both processes if present (later than 1.1 s). During the transition in coupling strength between 0.8 and 1.1 s TE was detected, but the method showed a small error in the reconstructed information transfer delay. This may be due to too little data to detect the weaker coupling at this epoch of the simulated coupling (see below).

#### Results – Scenario (2), unidirectional coupling with two-step increase.

For scenario (2), we again used the scanning approach for TE reconstruction, using an interval of assumed delays , where the true delay was simulated at . No TE was detected prior to the coupling onset around 1 s. TE was detected for analysis windows 4, 5, and 6 (1.1 to 1.4, 1.4 to 1.7, 1.7 to 2.0 s) with reconstructed information transfer delays of 10, 4, and 7 ms respectively. Further, significant TE was found for analysis windows 7 and 8 (after the second increase in coupling strength around 2 s). Here, the correct coupling of 10 ms was reconstructed. One false positive result was obtained in window 6 (1.7 to 2.0 s), where significant TE was found in the direction .

Note, that the method's ability to recover information transfer from data depends on the strength of the coupling relative to the amount of data that is available for TE estimation. This is observable in the reconstructed TE in the third analysis window for scenario (1) and (2): in scenario (2) no TE is detected, whereas in scenario (1) weak information transfer is already reconstructed for the third window. Note, that in scenario (2) the simulated coupling between 1 and 2 s is much weaker than the coupling in the unidirectional scenario (1) (Figure 7, panels **C** and **B**). This resulted in smaller and non-significant absolute TE values and in reconstructed information transfer delays that were less precise.

#### Results – Scenario (3), bidirectional coupling.

For scenario (3), we used the scanning approach for TE reconstruction, using an interval of assumed delays , where the true delay was simulated at and . No TE in either direction was detected prior to the first coupling onset around 1 s. TE for the first direction was detected after coupling onset around 1 s for analysis windows 4, 5, 6, 7, and 8. Reconstructed information transfer delays were 8 and 2 ms for analysis windows 4 and 5. For each of the following analysis windows 6 to 8 the correct delay of 10 ms was reconstructed.

TE for the second direction was detected after coupling onset around 2 s for analysis windows 7 and 8, where also the correct coupling of 20 ms was reconstructed. Thus, the proposed implementation was able to reconstruct information transfer in bidirectionally coupled systems.

### Evaluation of the robustness of ensemble-based TE-estimation

We tested the robustness of the ensemble method for cases where the amount of data available for TE estimation was severely limited. We created two coupled Lorenz systems , from which we sampled a maximum number of 300 repetitions of 300 ms each at 1000 Hz, using a coupling delay of (see equation 11). We embedded the resulting data with their optimal embedding parameters for different values of the assumed delay (30 to 60 ms, step size of 1 ms, also see equation 4). From the embedded data, we used subsets of data points with varying size () to estimate TE according to equation 10 (we always used the first consecutive data points for TE estimation). For each and number of data points , we created surrogate data to test the estimated TE value for statistical significance. Furthermore, we reconstructed the corresponding information transfer delay for each by finding the maximum TE value over all values for . A reconstructed TE value was considered a robust estimation of the simulated coupling if the reconstructed delay value was able to recover the simulated information transfer delay of with an error of , i.e. .

A sufficiently accurate reconstruction was reached for 10000 and 30000 data points (Figure 8). For 5000 data points estimation was off by approximately 7% (the reconstructed information transfer delay was 48 ms), less data entering the estimation led to a further decline in accuracy of the recovered information transfer delay (here, reconstructed delays were 50 ms and 54 ms for 2000 and 500 data points respectively).

Estimated transfer entropy (TE) values for estimations using varying numbers of data points (color coded) as a function of . Data was sampled from two Lorenz systems and with coupling . The simulated information transfer delay is indicated by a vertical dotted line. Sampled data was embedded and varying numbers of embedded data points (500, 2000, 5000, 10000, 30000) were used for TE estimation. For each estimation, the maximum values for all values of are indicated by solid dots. Dashed lines indicate significance thresholds ().

### Evaluation on neural time series from magnetoencephalography

To demonstrate the proposed method's suitability for time-resolved reconstruction of information transfer and the corresponding delays from biological time series, we analyzed magnetoencephalographic (MEG) recordings from a perceptual closure experiment described in [86].

#### Subjects.

MEG data were obtained from 15 healthy subjects (11 females; mean SD age, 25.4 5.6 years), recruited from the local community.

#### Task.

Subjects were presented with a randomized sequence of degraded black and white picture of human faces [88] (Figure 9, panel **A**) and scrambled stimuli, where black and white patches were randomly rearranged to minimize the likelihood of detecting a face. Subjects had to indicate the detection of a face or no-face by a button press. Each stimulus was presented for 200 ms, with a random inter-repetition interval (IRI) of 3500 to 4500 ms (9, panel **E**). For further analysis we used repetitions with correctly identified face conditions only.

Time resolved reconstruction of transfer entropy (TE) from magnetoencephalographic (MEG) source data, recorded during a face recognition task. (A) Face stimulus [88]. (B) Cortical sources after beamforming of MEG data (L, left; R, right: L orbitofrontal cortex (OFC); R middle frontal gyrus (MiFG); L inferior frontal gyrus (IFG left); R inferior frontal gyrus (IFG right); L anterior inferotemporal cortex (aTL left); L cingulate gyrus (cing); R premotor cortex (premotor); R superior temporal gyrus (STG); R anterior inferotemporal cortex (aTL right); L fusiform gyrus (FFA); L angular/supramarginal gyrus (SMG); R superior parietal lobule/precuneus (SPL); L caudal ITG/LOC (cITG); R primary visual cortex (V1)). (C) Reconstructed TE in three single subjects (red box) in three time windows (0−150 ms, 150−300 ms, 300−450 ms). Each link (red arrows) corresponds to significant TE on single subject level (corrected for multiple comparisons). (D) Thresholded TE links over 15 subjects (blue box) in three time windows (0−150 ms, 150−300 ms, 300−450 ms). Each link (black arrows) corresponds to significant TE in eight and more individual subjects (, after correction for multiple comparisons). Blue arrows indicate differences between time windows, i.e. links that occur for the first time in the respective window. (E) Experimental design: stimulus was presented for 200 ms (gray shading), during the inter stimulus interval (ISI, 1800 ms) a fixation cross was displayed.

#### MEG and MRI data acquisition.

MEG data were recorded using a 275-channel whole-head system (Omega 2005, VSM MedTech Ltd., BC, Canada) at a rate of 600 Hz in a synthetic third order axial gradiometer configuration. The data were filtered with 4th order Butterworth filters with 0.5 Hz high-pass and 150 Hz low-pass. Behavioral responses were recorded using a fiber optic response pad (Lumitouch, Photon Control Inc., Burnaby, BC, Canada).

Structural magnetic resonance images (MRI) were obtained with a 3 T Siemens Allegra, using 3D magnetization-prepared rapid-acquisition gradient echo sequence. Anatomical images were used to create individual head models for MEG source reconstruction.

#### Data analysis.

MEG data were analyzed using the open source MATLAB toolboxes FieldTrip (version 2008-12-08; [89]), SPM2 ((http://www.fil.ion.ucl.ac.uk/spm/), and TRENTOOL [56]. We will briefly describe the applied analysis here, for a more in depth treatment refer to [86].

For data preprocessing, data epochs (repetitions) were defined from the continuously recorded MEG signals from −1000 to 1000 ms with respect to the onset of the visual stimulus. Only data repetitions with correct responses were considered for analysis. Data epochs contaminated by eye blinks, muscle activity, or jump artifacts in the sensors were discarded. Data epochs were baseline corrected by subtracting the mean amplitude during an epoch ranging from −500 to −100 ms before stimulus onset.

To investigate differences in source activation in the face and non-face condition, we used a frequency domain beamformer [90] at frequencies of interest that had been identified at the sensor level (80 Hz with a spectral smoothing of 20 Hz). We computed the frequency domain beamformer filters for combined data epochs (“common filters”) consisting of activation (multiple windows, duration, 200 ms; onsets at every 50 ms from 0 to 450 ms) and baseline data (−350 to −150 ms) for each analysis interval. To compensate for the short duration of the data windows, we used a regularization of [91].

To find significant source activations in the face versus non-face condition, we first conducted a within-subject t-test for activation versus baseline effects. Next, the t-values of this test statistic were subjected to a second-level randomization test at the group level to obtain effects of differences between face and no-face conditions; a p-value 0.01 was considered significant. We identified 14 sources with differential spectral power between both conditions in the frequency band of interest in occipital, parietal, temporal, and frontal cortices (see Figure 9, panel **B**, and [86] for exact anatomical locations). We then reconstructed source time courses for TE analysis, this time using a broadband beamformer with a bandwidth of 10 to 150 Hz.

We estimated TE between beamformer source time courses using our ensemble method with a mixed pooling of embedded time points over repetitions and time windows (eq. 10). We analyzed three non-overlapping time windows of 150 ms each (0–150 ms, 150–300 ms, 300–450 ms, Figure 9, panel **C**). We furthermore reconstructed information transfer delays for significant information transfer by scanning over a range of assumed delays from 5 to 17 ms (resolution 2 ms), following the approach in [53]. We corrected the resulting information transfer pattern for cascade effects as well as common drive effects using a graph-based post-hoc correction proposed in [54].

#### Results.

Time-resolved GPU-based TE analysis revealed significant information transfer at the group-level ( corrected for multiple comparison; binomial test under the null hypothesis of the number of occurrences of a link being -distributed, where and ), that changed over time (Figure 9, panel **D** and table 2 for reconstructed information transfer delays). Our preliminary findings of information transfer are in line with hypothesis formulated in [92], [93] and [86], and the time-dependent changes show the our method's sensitivity to the dynamics of information processing during experimental stimulation, in line with the simulation results above.

## Discussion

### Efficient transfer entropy estimation from an ensemble of time series

We presented an efficient implementation of the ensemble method for TE estimation proposed by [55]. As laid out in the introduction, estimating TE from an ensemble of data allows to analyze information transfer between time series that are non-stationary and enables the estimation of TE in a time-resolved fashion. This is especially relevant to neuroscientific experiments, where rapidly changing (and thus non-stationary) neural activity is believed to reflect neural information processing. However, up until now the ensemble method has remained out of reach for application in neuroscience because of its computational cost. Only with using parallelization on a GPU, as presented here, the ensemble method becomes a viable tool for the analysis of neural data. Thus, our approach makes it possible for the first time to efficiently analyze information transfer between neural time series on short time scales. This allows us to handle the non-stationarity of underlying processes and makes a time- resolved estimation of TE possible. To facilitate the use of the ensemble method it has been implemented as part of the open source toolbox TRENTOOL (version 3.0).

Even though we will focus on neural data when discussing applications of the ensemble method for TE estimation below, this approach is well suited for applications in other fields. For example, TE as defined in [14] has been applied in physiology [42]–[44], climatology [94], [95], financial time series analysis [45], [96], and in the theory of cellular automata [48]. Large datasets from these and other fields may now be easily analyzed with the presented approach and its implementation in TRENTOOL.

### Notes on the practical application of the ensemble method for TE estimation

#### Applicability to simulated and real world experimental data.

To validate the proposed implementation of the ensemble method, we applied it to simulated data as well as MEG recordings. For simulated data, information transfer could reliably be reconstructed despite the non-stationarity in the underlying generating processes. For MEG data the obtained speed-up was large enough to analyze these data in practical time. Information transfer reconstructed in a time-resolved fashion from the MEG source data was in line with findings by [86], [92], [93], as discussed below.

Note, that even though our proposed implementation of the ensemble method reduces analysis times by a significant amount, the estimation of TE from neural time series is still time consuming relative to other measures of connectivity. For the example MEG data set presented in this paper, TE estimation for one subject and one analysis window took 93 hours on average (when scanning over seven values for the assumed information transfer delay and reconstructing TE for all possible combinations of 14 sources). Thus, for 15 subjects with three analysis windows each, the whole analysis would take approximately six months when carried out in a serial fashion on one computer equipped with a modern GPU (e.g. NVIDIA GTX Titan). This time may however be reduced by parallelizing the analysis over subjects and analysis windows on multiple GPUs, as it was done for this study.

#### Available data and choice of window size.

As available data is often limited in neuroscience and other real-world applications, the user has to make sure that enough data enters the analysis, such that a reliable estimation of TE is possible. In the proposed implementation of the ensemble method for TE estimation the amount of data entering the estimation directly depends on the size of the chosen analysis window and the amount of available repetitions of the process being analyzed. Furthermore, the choice of the embedding parameters lead to varying numbers of embedded data that can be obtained from scalar time series. When estimating TE from neural data, we therefore recommend to control the amount of data in one analysis window that is available after embedding and to design experiments accordingly. For example, the presented MEG data set was sampled at 600 Hz, with 137 repetitions of the stimulus on average, which - after embedding - led to 8800 data points per analysis window of 150 ms. In comparison, for simulated data TE was reconstructed correctly for 10000 data points and more. Thus, in our example MEG data set, shorter analysis windows would not have been advisable because of an insufficient amount of data per analysis window for reliable TE estimation. If shorter analysis windows are necessary, they will have to be counterbalanced by a higher number of experimental repetitions.

Thus, the choice of an appropriate analysis window is crucial to guarantee reliable TE estimation, while still resolving the temporal dynamics under investigation. A further data limiting factor is the need for an appropriate embedding of the scalar time series. To embed the time series at a given point , enough history for this sample (embedding dimension times the embedding delay in sample points) has to be recorded. We call this epoch the *embedding window*. The need for an appropriate embedding thus constitutes another constraint for the data necessary for TE estimation. Thus, the choice of an optimal embedding dimension (e.g. through the use of Ragwitz' criterion [71]) is crucial as the use of larger than optimal embedding dimensions wastes available data and may lead to a weaker detection rate in noisy data [56].

Note, that the embedding window should not be confused with the analysis window. The analysis window strictly describes the data points, for which neighbor statistics enter TE estimation – where neighbor counts may be averaged over an epoch or may come from a single point in time only. The embedding window however, describes the data points that enter the embedding of a single point in time. Thus, the temporal resolution of TE analysis may still be in single time steps (i.e. only one time point entering the analysis), even though the embedding window spans several points in time that contain the history for this single point.

### Repeatability of neuronal processes

When applying the ensemble method to estimate TE from neural recordings, we treat experimental repetitions as multiple realizations of the neural processes under investigation. In doing so, we assume stationarity of these processes *over repetitions*. We claim that in most cases this assumption of stationarity is justified for processes concerned with the processing of experimental stimuli and that the assumption also holds for stimulus-independent processes that contribute to neural recordings. We will first present the different contributions to neural recordings and subsequently discuss their individual statistical properties, i.e. their stationarity over repetitions. Note, that the term stationarity refers to the stability of the *probability distribution underlying* the observed realizations of contributions over repetitions and does not require individual realizations to be identical; i.e. stationarity does not preclude a variability in observed realizations, but rather implies some variance in observed realizations, that is reflective of the variance in the underlying probability distribution.

Contributions to neural recordings may either be stimulus-related (*event-related activity*) or stimulus-independent (*spontaneous ongoing activity*). Within the category of event-related activity, contributions can be further distinguished into phase-locked and non phase-locked contributions (the latter is commonly called *induced activity*). Phase-locked activity has a fixed polarity and latency with respect to the stimulus and – on averaging over repetitions – contributes to an event-related potential or field (ERP/F). Phase-locked activity is further distinguished into two types of contributions, that are discussed as mechanisms in the ERP/F-generation (e.g. [97]–[99]): (1) *additive evoked contributions*, i.e. neural activity that is in addition to ongoing activity and represents the stereotypical response of a neural population to the presented stimulus in each repetition [100]–[102]; (2) *phase- reset contributions*, i.e. the phase of ongoing activity is reset by the stimulus, such that phase-aligned signals no longer cancel each other out on averaging over repetitions [103]–[106]. In contrast to these two subtypes of phase-locked activity, induced activity is event-related activity that is not phase-locked to the stimulus, such that latency and polarity vary randomly over repetitions and induced activity averages out over repetitions.

We therefore have to consider four types of contributions to neural recordings: (1) additive evoked contributions, (2) phase-reset contributions, (3) induced contributions and (4) spontaneous ongoing contributions, the last being stimulus-independent. Stationarity can be assumed for all these contributions if no learning effects occur during the experiment. Learning effects may lead to slow drifts, i.e. changing mean and variances, in the recorded signal. Such learning effects may easily be tested for by comparing the first and second half of recorded repetitions with respect to equal variances and means. If variances and means are equal, learning effects can most likely be excluded. Empirically, the stationarity assumption, specifically of phase-locked contributions, can also be verified using a modified independent component analysis recently proposed in [107].

To sum up the statistical properties of different contributions to neural data and their relevance for using an ensemble approach to TE estimation, we conclude that all contributions to neural recordings can be considered stationary over repetitions by default. Non-stationarity over repetitions will only be a problem in paradigms that introduce (slow) drifts or trends in the recorded signal, for example by facilitating learning during the experiment. Testing for drifts may be done by comparing mean and variance in a split-half analysis.

### Relation of the ensemble method to local information dynamics

We will now discuss the relation of the ensemble approach suggested here to the local transfer entropy (LTE) approach of Lizier [4], [15]. This may be useful as both approaches at first glance seem to have a similar goal, i.e. assessing information transfer more locally in time. As we will show, the approaches differ in what quantities they localize. From this difference it also follows that they can (and should be) combined when necessary.

In detail, the ensemble approach used here tries to exploit cyclostationarity or repeatability of random processes to obtain multiple PDFs from the different (quasi-) stationary parts of the repeated process cycle, or a PDF for each step in time from replications of a process, respectively. In contrast, local information dynamics localizes information transfer in time (and space) given the PDF of a *stationary* process.

The local information dynamics approach to information transfer computes information transfer for stationary random processes from their joint and marginal PDFs for each process step, thereby fully localizing information transfer in time. The quantity proposed for this purpose is the LTE [15]:(17)

LTE relates to TE in the same way Shannon information relates to Shannon entropy – by means of taking an expected value under the common PDF of the collection of random variables that form the processes , , which exchange information. Stationarity here guarantees that all the random variables () have a common PDF (as the PDF is not allowed to change over time):(18)

In contrast, the approach presented here does not assume that the random processes , are stationary, but that either replications if the process can be obtained, or that the process is cyclostationary. Under these constraints a *local* PDF can be obtained. The events drawn from this PDF may then be analyzed in terms of their average information transfer, i.e. using TE as presented here, or by inspecting them individually, computing LTE for each event. In this sense, the approach suggested here is aimed at extracting the proper local PDFs, while local information dynamics comes into play once these proper PDFs have been obtained. We are certain that both approaches can be fruitfully combined in future studies.

### Relation of the ensemble method to other measures of connectivity for non-stationary data

Linear Granger causality (GC) is – as has been shown recently by [26] – equivalent to TE for variables with a *jointly* Gaussian distribution. Thus, for data that exhibit such a distribution, information transfer may be analyzed more easily within the GC framework. Similar to the ensemble method for TE estimation, extensions to GC estimation have been proposed that deal with non-stationary data by fitting time-variant parameters. For example, Möller and colleagues presented an approach that fitted multivariate autoregressive models (MVAR) with time-dependent parameters to an ensemble of EEG signals [108]. Similar measures, that fit time-dependent parameters in autoregressive models to data ensembles, were used by [109] and [110]. A different approach to dealing with non-stationarity was taken by Leistritz and colleagues [111]. These authors proposed to use self-exciting threshold autoregressive (SETAR) models to model neural time series within a GC framework. SETAR models extend traditional AR models by introducing state-dependent model parameters and allow for the modeling of transient components in the signal.

The presented methods for the estimation of time-variant linear GC may yield a computationally less expensive approach to the estimation of information transfer from an ensemble of data. However, linear GC is equivalent to TE regarding the full recovery of information transfer for data with a *jointly* Gaussian distribution only. For non-Gaussian data, linear GC may fail to capture higher order interactions. As neural data are most likely non-Gaussian, the application of TE may have an advantage for the analysis of information transfer in this type of data. The non-Gaussian nature of neural data can for example be seen, when comparing brain electrical source signals from physical inverse methods to time courses of corresponding ICA components [112]. Here, ICA components and extracted brain signals closely match. Given that ICA components are as non-Gaussian as possible (by definition of the ICA), we can infer that brain signals are very likely non-Gaussian.

We also note that a nonstationary measure of *coupling* between dynamic systems building on repetitions of time series and next-neighbor statistics was suggested by Andrzejak and colleagues [113]. The key difference of their approach to the ensemble method suggested here is that the previous states of the target time series are not taken into account explicitly in their method. Hence, their measure is not (and was not intended to be) a measure of information transfer (see [53] for details why a measure of information transfer needs to include the past state of target time series, and [48] for the difference between measures of (causal) coupling and information transfer). In addition, their methods explicitly tries to determine the *direction* of coupling between to systems. This implies that there should be a dominant direction of coupling in order to obtain meaningful results. Transfer entropy, in contrast, easily separates and quantifies both directions of information transfer related to bidirectional coupling, under some mild conditions related to entropy production in each of the two coupled systems [53].

### Relation of the ensemble method to the direct method for the calculation of mutual information of Strong and colleagues

The ensemble method proposed here shares the use of replications (or trials) with the so called ‘direct method’ of Strong and colleagues [114]. The authors introduced this method to calculate mutual information between a controlled stimulus set and neural responses. Similarities also exist in the sense that the surrogate data method for statistical evaluation used in our ensemble method builds on trial-to-trial variability, as does Strong's method (by looking at intrinsic variability versus variability driven by stimulus changes).

However, the two methods differ conceptually on two accounts: First, the quantity estimated is different – symmetric mutual information in Strong's method compared to inherently asymmetric conditional mutual information in the case of TE. Second, the method of Strong and colleagues requires a direct intervention in the source of information (i.e. the stimuli) to work, whereas TE in general is independent of such interventions. This has far reaching consequences for the interpretation of the two measures: The intervention inherent in Strong's method places it somewhat closer to causal measures such as Ay and Polani's causal information flow [47], whereas intervention-free TE has a clear interpretation as the information transfered in relation to distributed computation [48]. As a consequence, TE maybe easily applied to quantify neural information transfer from one neuron or brain area to another even under *constant* stimulus conditions. In contrast, using Strong's method inside a neural system in this way would require precisely setting of the activity of the source neuron or brain area, something that may often be difficult to do.

### Application of the proposed implementation to other dependency measures

The use of ensemble pooling of observations for the estimation of time-resolved dependency measures has been proposed in a variety of frameworks. For example, Andrzejak and colleagues [113] use ensemble pooling of delay-embedded time series in combination with nearest neighbor statistics as a general approach to the estimation of arbitrary non-linear dependency measures. However, the practical application of ensemble pooling and nearest neighbor statistics together with the necessary generation of a sufficient amount of surrogate data sets (typically 1000 in neuroscience applications where correction for multiple comparisons is necessary) was always hindered by its high computational cost. Only with the presentation of a GPU algorithm for nearest neighbor searches, we provide an implementation of the ensemble method that allows its practical application. Note that even though we use ensemble pooling and GPU search algorithms to specifically estimate TE, the presented implementation may easily be adapted to other dependency measures that are calculated from (conditional) mutual informations estimated from nearest neighbor statistics.

### Application to MEG source activity in a perceptual closure task

Application of the ensemble-based TE estimation to MEG source activities revealed a time varying pattern of information transfers, as expected in the nonstationary setting of the visual task. While a full discussion of the revealed information transfer pattern is beyond the scope of this study, we point out individual connections transferring information that underline the validity of our results. Notable connections in the first time window transfer information from the early visual cortices (V1) to the orbitofrontal cortex (OFC) – in line with earlier findings by Bar an colleagues [92], that suggest a role of the OFC in early visual scene segmentation and gist perception. Another brain area receiving information from early visual cortex is the caudal inferior temporal gyrus (cITG)[115], an area responsible for the processing of shape-from-shading information, which is thought to be essential for perception of Mooney stimuli as they were used here. Both of these areas, OFC and cITG at later stages of processing exchange information with the fusiform face area, which is essential for the processing of faces [116]–[118], and thereby expected to receive information from other areas in this task. Indeed, FFA seems to be an essential hub in the task-related network investigated in this study and receives increasing amounts of incoming information transfer as the task progresses in time. This is in line with the fact that the most pronounced task-related differences in FFA activity were found at latencies 200 ms previously [86].

Our data also clearly show a great variability in information transfer pattern across subjects, which we relate to the limited amount of data per subject, rather than to true variation. Moreover, future investigations will have to show whether more fine grained temporal segmentation of the neural information processing in this task is possible and whether it will provide additional insights.

### Conclusion and further directions

We presented an implementation of the ensemble method for TE presented in [55], that uses a GPU to handle computationally most demanding aspects of the analysis. We chose an implementation that is flexible enough to scale well with different experimental designs as well as with future hardware developments. Our implementation was able to successfully reconstruct information transfer in simulated and neural data in a time-resolved fashion. Nearest neighbor searches using a GPU exhibited substantially reduced execution times. The implementation has been made available as part of the open source MATLAB toolbox TRENTOOL [56] for the use with CUDA-enabled GPU devices.

We conclude that the ensemble method in its presented implementation is a suitable tool for the analysis of non-stationary neural time series, enabling this type of analysis for the first time. It may also be applicable in other fields that are concerned with the analysis of information transfer within complex dynamic systems.

## Author Contributions

Conceived and designed the experiments: MW RV MMZ PW FDP. Performed the experiments: PW MMZ MW. Analyzed the data: PW MMZ MW. Contributed reagents/materials/analysis tools: MMZ PW MW RV FDP. Wrote the paper: PW MMZ RV MW. Conceived and designed the parallel algorithm: MW MMZ. Implemented the algorithm: MMZ PW MW FDP. Designed the software used in analysis: MW RV MMZ PW FDP.

## References

- 1. Turing AM (1936) On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society 42: 230–265.
- 2. Langton CG (1990) Computation at the edge of chaos: Phase transitions and emergent computation. Physica D: Nonlinear Phenomena 42: 12–37.
- 3.
Mitchell M (1998) Computation in cellular automata: A selected review. In: Gramβ T, Bornholdt S, Groβ M, Mitchell M, Pellizzari T, editors, Non-Standard Computation, Weinheim: Wiley-VCH Verlag GmbH & Co. KGaA. pp. 95–140.
- 4.
Lizier JT (2013) The local information dynamics of distributed computation in complex systems. Springer Theses Series. Berlin/Heidelberg: Springer.
- 5. Wibral M, Lizier JT, Vögler S, Priesemann V, Galuske R (2014) Local active information storage as a tool to understand distributed neural information processing. Front Neuroinform 8: 1.
- 6. Lizier JT, Prokopenko M, Zomaya AY (2010) Information modification and particle collisions in distributed computation. Chaos 20: 037109–037109.
- 7.
Lizier JT, Flecker B, Williams PL (2013) Towards a synergy-based approach to measuring information modification. arXiv preprint arXiv:13033440.
- 8.
Williams PL, Beer RD (2010) Nonnegative decomposition of multivariate information. arXiv preprint arXiv:10042515.
- 9.
Bertschinger N, Rauh J, Olbrich E, Jost J (2012) Shared information – New insights and problems in decomposing information in complex systems. arXiv preprint arXiv:12105902.
- 10.
Griffith V, Koch C (2012) Quantifying synergistic mutual information. arXiv preprint arXiv:12054265.
- 11. Harder M, Salge C, Polani D (2013) Bivariate measure of redundant information. Phys Rev E Stat Nonlin Soft Matter Phys 87: 012130.
- 12. Bertschinger N, Rauh J, Olbrich E, Jost J, Ay N (2014) Quantifying unique information. Entropy 16: 2161–2183.
- 13. Lizier JT, Prokopenko M, Zomaya AY (2012) Local measures of information storage in complex distributed computation. Inform Sciences 208: 39–54.
- 14. Schreiber T (2000) Measuring information transfer. Phys Rev Lett 85: 461–464.
- 15. Lizier JT, Prokopenko M, Zomaya AY (2008) Local information transfer as a spatiotemporal filter for complex systems. Phys Rev E Stat Nonlin Soft Matter Phys 77: 026110.
- 16.
Lizier JT (2014) Measuring the dynamics of information processing on a local scale in time and space. In: Wibral M, Vicente R, Lizier JT, editors, Directed Information Measures in Neuroscience, Springer Berlin Heidelberg, Understanding Complex Systems. pp. 161–193.
- 17. Gómez C, Lizier JT, Schaum M, Wollstadt P, Grützner C, et al. (2014) Reduced predictable information in brain signals in autism spectrum disorder. Front Neuroinform 8: 9.
- 18.
Dasgupta S, Wörgötter F, Manoonpong P (2013) Information dynamics based self-adaptive reservoir for delay temporal memory tasks. Evolving Systems: 1–15.
- 19. Vicente R, Wibral M, Lindner M, Pipa G (2011) Transfer entropy – a model-free measure of effective connectivity for the neurosciences. J Comput Neurosci 30: 45–67.
- 20. Wibral M, Rahm B, Rieder M, Lindner M, Vicente R, et al. (2011) Transfer entropy in magnetoencephalographic data: quantifying information flow in cortical and cerebellar networks. Prog Biophys Mol Biol 105: 80–97.
- 21. Paluš M (2001) Synchronization as adjustment of information rates: detection from bivariate time series. Phys Rev E Stat Nonlin Soft Matter Phys 63: 046211.
- 22. Vakorin VA, Kovacevic N, McIntosh AR (2010) Exploring transient transfer entropy based on a group-wise ICA decomposition of EEG data. Neuroimage 49: 1593–1600.
- 23. Vakorin VA, Krakovska OA, McIntosh AR (2009) Confounding effects of indirect connections on causality estimation. J Neurosci Methods 184: 152–160.
- 24. Chávez M, Martinerie J, Le Van Quyen M (2003) Statistical assessment of nonlinear causality: application to epileptic EEG signals. J Neurosci Methods 124: 113–28.
- 25. Amblard PO, Michel OJ (2011) On directed information theory and Granger causality graphs. J Comput Neurosci 30: 7–16.
- 26. Barnett L, Barrett AB, Seth AK (2009) Granger causality and transfer entropy are equivalent for Gaussian variables. Phys Rev Lett 103: 238701.
- 27. Besserve M, Scholkopf B, Logothetis NK, Panzeri S (2010) Causal relationships between frequency bands of extracellular signals in visual cortex revealed by an information theoretic analysis. J Comput Neurosci 29: 547–566.
- 28. Buehlmann A, Deco G (2010) Optimal information transfer in the cortex through synchronization. PLoS Comput Biol 6: e1000934.
- 29. Garofalo M, Nieus T, Massobrio P, Martinoia S (2009) Evaluation of the performance of information theory-based methods and cross-correlation to estimate the functional connectivity in cortical networks. PLoS One 4: e6482.
- 30. Gourevitch B, Eggermont JJ (2007) Evaluating information transfer between auditory cortical neurons. J Neurophysiol 97: 2533–2543.
- 31. Lizier JT, Heinzle J, Horstmann A, Haynes JD, Prokopenko M (2011) Multivariate informationtheoretic measures reveal directed information structure and task relevant changes in fmri connectivity. J Comput Neurosci 30: 85–107.
- 32. Lüdtke N, Logothetis NK, Panzeri S (2010) Testing methodologies for the nonlinear analysis of causal relationships in neurovascular coupling. Magn Reson Imaging 28: 1113–1119.
- 33. Neymotin SA, Jacobs KM, Fenton AA, Lytton WW (2011) Synaptic information transfer in computer models of neocortical columns. J Comput Neurosci 30: 69–84.
- 34. Sabesan S, Good LB, Tsakalis KS, Spanias A, Treiman DM, et al. (2009) Information flow and application to epileptogenic focus localization from intracranial EEG. IEEE Trans Neural Syst Rehabil Eng 17: 244–53.
- 35. Staniek M, Lehnertz K (2009) Symbolic transfer entropy: inferring directionality in biosignals. Biomed Tech (Berl) 54: 323–8.
- 36.
Vakorin VA, Misic B, Kraskovska O, McIntosh AR (2011) Empirical and theoretical aspects of generation and transfer of information in a neuromagnetic source network. Front Syst Neurosci 5.
- 37. Roux F, Wibral M, Singer W, Aru J, Uhlhaas PJ (2013) The phase of thalamic alpha activity modulates cortical gamma-band activity: evidence from resting-state meg recordings. J Neurosci 33: 17827–17835.
- 38.
Pampu NC, Vicente R, Muresan RC, Priesemann V, Siebenhuhner F, et al. (2013) Transfer entropy as a tool for reconstructing interaction delays in neural signals. In: Signals, Circuits and Systems (ISSCS), 2013 International Symposium on. IEEE, pp. 1–4.
- 39.
Wibral M, Vicente R, Lindner M (2014) Transfer entropy in neuroscience. In: Wibral M, Vicente R, Lizier JT, editors, Directed Information Measures in Neuroscience, Springer Berlin Heidelberg, Understanding Complex Systems. pp. 3–36.
- 40.
Marinazzo D, Wu G, Pellicoro M, Stramaglia S (2014) Information transfer in the brain: Insights from a unified approach. In: Wibral M, Vicente R, Lizier JT, editors, Directed Information Measures in Neuroscience, Springer Berlin Heidelberg, Understanding Complex Systems. pp. 87–110.
- 41.
Faes L, Porta A (2014) Conditional entropy-based evaluation of information dynamics in physiological systems. In: Wibral M, Vicente R, Lizier JT, editors, Directed Information Measures in Neuroscience, Springer Berlin Heidelberg, Understanding Complex Systems. pp. 61–86.
- 42. Faes L, Nollo G (2006) Bivariate nonlinear prediction to quantify the strength of complex dynamical interactions in short-term cardiovascular variability. Med Biol Eng Comput 44: 383–392.
- 43. Faes L, Nollo G, Porta A (2011) Non-uniform multivariate embedding to assess the information transfer in cardiovascular and cardiorespiratory variability series. Comput Biol Med 42: 290–297.
- 44. Faes L, Nollo G, Porta A (2011) Information-based detection of nonlinear granger causality in multivariate processes via a nonuniform embedding technique. Phys Rev E Stat Nonlin Soft Matter Phys 83: 051112.
- 45. Kwon O, Yang JS (2008) Information flow between stock indices. Europhys Lett 82: 68003.
- 46. Kim J, Kim G, An S, Kwon YK, Yoon S (2013) Entropy-based analysis and bioinformatics-inspired integration of global economic information transfer. PLoS ONE 8: e51986.
- 47. Ay N, Polani D (2008) Information flows in causal networks. Adv Complex Syst 11: 17.
- 48. Lizier JT, Prokopenko M (2010) Differentiating information transfer and causal effect. Eur Phys J B 73: 605–615.
- 49. Chicharro D, Ledberg A (2012) When two become one: the limits of causality analysis of brain dynamics. PLoS One 7: e32466.
- 50.
Lizier JT, Rubinov M (2012) Multivariate construction of effective computational networks from observational data. Max Planck Institute for Mathematics in the Sciences Preprint 25/2012.
- 51. Stramaglia S, Wu GR, Pellicoro M, Marinazzo D (2012) Expanding the transfer entropy to identify information circuits in complex systems. Phys Rev E Stat Nonlin Soft Matter Phys 86: 066211.
- 52. Bettencourt LM, Stephens GJ, Ham MI, Gross GW (2007) Functional structure of cortical neuronal networks grown in vitro. Phys Rev E Stat Nonlin Soft Matter Phys 75: 021915.
- 53. Wibral M, Pampu N, Priesemann V, Siebenhühner F, Seiwert H, et al. (2013) Measuring information-transfer delays. PloS one 8: e55809.
- 54. Wibral M, Wollstadt P, Meyer U, Pampu N, Priesemann V, et al. (2012) Revisiting Wiener's principle of causality – interaction-delay reconstruction using transfer entropy and multivariate analysis on delay-weighted graphs. Conf Proc IEEE Eng Med Biol Soc 2012: 3676–3679.
- 55.
Gomez-Herrero G, Wu W, Rutanen K, Soriano M, Pipa G, et al. (2010) Assessing coupling dynamics from an ensemble of time series. arXiv preprint arXiv:10080539.
- 56. Lindner M, Vicente R, Priesemann V, Wibral M (2011) TRENTOOL: A MATLAB open source toolbox to analyse information flow in time series data with transfer entropy. BMC Neurosci 12: 119.
- 57. Kraskov A, Stoegbauer H, Grassberger P (2004) Estimating mutual information. Phys Rev E Stat Nonlin Soft Matter Phys 69: 066138.
- 58. Owens JD, Houston M, Luebke D, Green S, Stone JE, et al. (2008) GPU computing. Proc IEEE 96: 879–899.
- 59. Brodtkorb AR, Hagen TR, Sætra ML (2013) Graphics processing unit (GPU) programming strategies and trends in GPU computing. J Parallel Distr Com 73: 4–13.
- 60. Lee D, Dinov I, Dong B, Gutman B, Yanovsky I, et al. (2012) CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms. Comput Methods Programs Biomed 106: 175–187.
- 61. Martínez-Zarzuela M, Gómez C, Díaz-Pernas FJ, Fernández A, Hornero R (2013) Crossapproximate entropy parallel computation on GPUs for biomedical signal analysis. Application to MEG recordings. Comput Methods Programs Biomed 112: 189–199.
- 62. Konstantinidis EI, Frantzidis CA, Pappas C, Bamidis PD (2012) Real time emotion aware applications: A case study employing emotion evocative pictures and neuro-physiological sensing enhanced by graphic processor units. Comput Methods Programs Biomed 107: 16–27.
- 63. Arefin AS, Riveros C, Berretta R, Moscato P (2012) GPU-FS-kNN: A software tool for fast and scalable kNN computation using GPUs. PLoS One 7: e44000.
- 64. Wilson JA, Williams JC (2009) Massively parallel signal processing using the graphics processing unit for real-time brain-computer interface feature extraction. Front Neuroeng 2: 11.
- 65. Chen D, Wang L, Ouyang G, Li X (2011) Massively parallel neural signal processing on a manycore platform. Comput Sci Eng 13: 42–51.
- 66. Liu Y, Schmidt B, Liu W, Maskell DL (2010) CUDA-MEME: Accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units. Pattern Recognit Lett 31: 2170–2177.
- 67. Merkwirth C, Parlitz U, Lauterborn W (2000) Fast nearest-neighbor searching for nonlinear signal processing. Phys Rev E Stat Nonlin Soft Matter Phys 62: 2089–2097.
- 68. Gardner WA, Napolitano A, Paura L (2006) Cyclostationarity: Half a century of research. Signal Process 86: 639–697.
- 69.
Williams PL, Beer RD (2011) Generalized measures of information transfer. arXiv preprint arXiv:11021507.
- 70.
Takens F (1981) Dynamical Systems and Turbulence, Warwick 1980, Springer, volume 898 of
*Lecture Notes in Mathematics*, chapter Detecting Strange Attractors in Turbulence. pp. 366–381. - 71. Ragwitz M, Kantz H (2002) Markov models from data by simple nonlinear time series predictors in delay embedding spaces. Phys Rev E Stat Nonlin Soft Matter Phys 65: 056201.
- 72. Kozachenko L, Leonenko N (1987) Sample estimate of entropy of a random vector. Probl Inform Transm 23: 95–100.
- 73. Victor JD (2005) Binless strategies for estimation of information from neural data. Phys Rev E Stat Nonlin Soft Matter Phys 72: 051903.
- 74.
Vicente R, Wibral M (2014) Efficient estimation of information transfer. In: Wibral M, Vicente R, Lizier JT, editors, Directed Information Measures in Neuroscience, Springer Berlin Heidelberg, Understanding Complex Systems. pp. 37–58.
- 75.
NVIDIA Corporation (2013). CUDA toolkit documentation. Available: http://docs.nvidia.com/cuda. Accessed 7 November 2013.
- 76. Maris E, Oostenveld R (2007) Nonparametric statistical testing of EEG- and MEG-data. J Neurosci Methods 164: 177–90.
- 77. Bentley JL, Friedman JH (1979) Data structures for range searching. ACM Comput Surv 11: 397–409.
- 78. Arya S, Mount DM, Netanyahu NS, Silverman R, Wu AY (1998) An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J ACM 45: 891–923.
- 79.
Muja M, Lowe DG (2009) Fast approximate nearest neighbors with automatic algorithm configuration. In: In VISAPP International Conference on Computer Vision Theory and Applications. pp. 331–340.
- 80.
Garcia V, Debreuve E, Nielsen F, Barlaud M (2010) K-nearest neighbor search: Fast GPU-based implementations and application to high-dimensional feature matching. In: Image Processing (ICIP), 2010 17th IEEE International Conference on. pp. 3757–3760.
- 81.
Sismanis N, Pitsianis N, Sun X (2012) Parallel search of k-nearest neighbors with synchronous operations. In: High Performance Extreme Computing (HPEC), 2012 IEEE Conference on. pp. 1–6.
- 82.
Brown S, Snoeyink J. GPU nearest neighbor searches using a minimal kd-tree. Available: http://cs.unc.edu/~shawndb. Accessed 7 November 2013.
- 83.
Li S, Simons LC, Pakaravoor JB, Abbasinejad F, Owens JD, et al. (2012) kANN on the GPU with shifted sorting. In: Dachsbacher C, Munkberg J, Pantaleoni J, editors, Proceedings of the Fourth ACM SIGGRAPH/Eurographics conference on High-Performance Graphics. High Performance Graphics 2012, The Eurographics Association, pp. 39–47.
- 84.
Pan J, Manocha D (2012) Bi-level locality sensitive hashing for k-nearest neighbor computation. In: Data Engineering (ICDE), 2012 IEEE 28th International Conference on. pp. 378–389. doi: 10.1109/ICDE.2012.40.
- 85.
Khronos OpenCL Working Group, Munshi A (2009). The OpenCL specification version: 1.0 document revision: 48. Available: http://www.khronos.org/registry/cl/specs/opencl-1.0.pdf. Accessed 30 May 2014.
- 86. Grützner C, Uhlhaas PJ, Genc E, Kohler A, Singer W, et al. (2010) Neuroelectromagnetic correlates of perceptual closure processes. J Neurosci 30: 8342–8352.
- 87.
Kraskov A (2004) Synchronization and Interdependence measures and their application to the electroencephalogram of epilepsy patients and clustering of data. Ph.D. thesis, University of Wuppertal.
- 88. Mooney CM, Ferguson GA (1951) A new closure test. Can J Psychol 5: 129–133.
- 89. Oostenveld R, Fries P, Maris E, Schoffelen JM (2011) FieldTrip: Open source software for advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput Intell Neurosci 2011: 156869.
- 90. Gross J, Kujala J, Hamalainen M, Timmermann L, Schnitzler A, et al. (2001) Dynamic imaging of coherent sources: studying neural interactions in the human brain. Proc Natl Acad Sci U S A 98: 694–699.
- 91. Brookes MJ, Vrba J, Robinson SE, Stevenson CM, Peters AM, et al. (2008) Optimising experimental design for meg beamformer imaging. Neuroimage 39: 1788–1802.
- 92. Bar M, Kassam KS, Ghuman AS, Boshyan J, Schmid AM, et al. (2006) Top-down facilitation of visual recognition. P Natl Acad Sci USA 103: 449–454.
- 93.
Cavanagh P (1991) Whats up in top-down processing. In: Gorea A, editor, Representations of vision: Trends and tacit assumptions in vision research, Cambridge University Press. pp. 295–304.
- 94. Verdes PF (2005) Assessing causality from multivariate time series. Phys Rev E Stat Nonlin Soft Matter Phys 72: 026222.
- 95. Pompe B, Runge J (2011) Momentary information transfer as a coupling measure of time series. Phys Rev E Stat Nonlin Soft Matter Phys 83: 051122.
- 96. Marschinski R, Kantz H (2002) Analysing the information flow between financial time series. Eur Phys J B 30: 275–281.
- 97. Sauseng P, Klimesch W, Gruber WR, Hanslmayr S, Freunberger R, et al. (2007) Are event-related potential components generated by phase resetting of brain oscillations? A critical discussion. Neuroscience 146: 1435–1444.
- 98. Makeig S, Debener S, Onton J, Delorme A (2004) Mining event-related brain dynamics. Trends Cogn Sci 8: 204–210.
- 99. Shah AS, Bressler SL, Knuth KH, Ding M, Mehta AD, et al. (2004) Neural dynamics and the fundamental mechanisms of event-related brain potentials. Cereb Cortex 14: 476–483.
- 100.
Jervis BW, Nichols MJ, Johnson TE, Allen E, Hudson NR (1983) A fundamental investigation of the composition of auditory evoked potentials. IEEE Trans Biomed Eng: 43–50.
- 101.
Mangun GR (1992) Human brain potentials evoked by visual stimuli: induced rhythms or timelocked components? In: Basar E, Bullock TH, editors, Induced rhythms in the brain, Boston, MA: Birkhauser. pp. 217–231.
- 102.
Schroeder CE, Steinschneider M, Javitt DC, Tenke CE, Givre SJ, et al. (1995) Localization of ERP generators and identification of underlying neural processes. Electroen Clin Neuro Suppl 44: 55.
- 103. Sayers BM, Beagley H, Henshall W (1974) The mechanism of auditory evoked EEG responses. Nature 247: 481–483.
- 104. Makeig S, Westerfield M, Jung TP, Enghoff S, Townsend J, et al. (2002) Dynamic brain sources of visual evoked responses. Science 295: 690–694.
- 105. Jansen BH, Agarwal G, Hegde A, Boutros NN (2003) Phase synchronization of the ongoing EEG and auditory EP generation. Clin Neurophysiol 114: 79–85.
- 106. Klimesch W, Schack B, Schabus M, Doppelmayr M, Gruber W, et al. (2004) Phase-locked alpha and theta oscillations generate the P1–N1 complex and are related to memory performance. Cognitive Brain Res 19: 302–316.
- 107. Turi G, Gotthardt S, Singer W, Vuong TA, Munk M, et al. (2012) Quantifying additive evoked contributions to the event-related potential. Neuroimage 59: 2607–2624.
- 108. Möller E, Schack B, Arnold M, Witte H (2001) Instantaneous multivariate EEG coherence analysis by means of adaptive high-dimensional autoregressive models. J Neurosci Meth 105: 143–158.
- 109. Ding M, Bressler SL, Yang W, Liang H (2000) Short-window spectral analysis of cortical eventrelated potentials by adaptive multivariate autoregressive modeling: data preprocessing, model validation, and variability assessment. Biol Cybern 83: 35–45.
- 110. Hesse W, Möller E, Arnold M, Schack B (2003) The use of time-variant EEG Granger causality for inspecting directed interdependencies of neural assemblies. J Neurosci Meth 124: 27–44.
- 111. Leistritz L, Hesse W, Arnold M, Witte H (2006) Development of interaction measures based on adaptive non-linear time series analysis of biomedical signals. Biomed Tech 51: 64–69.
- 112. Wibral M, Turi G, Linden DEJ, Kaiser J, Bledowski C (2008) Decomposition of working memoryrelated scalp ERPs: crossvalidation of fMRI-constrained source analysis and ICA. Int J Psychophysiol 67: 200–211.
- 113. Andrzejak RG, Ledberg A, Deco G (2006) Detecting event-related time-dependent directional couplings. New Journal of Physics 8: 6.
- 114.
Strong SP, de Ruyter van Steveninck RR, Bialek W, Koberle R (1998) On the application of information theory to neural spike trains. Pac Symp Biocomput: 621–632.
- 115. Georgieva SS, Todd JT, Peeters R, Orban GA (2008) The extraction of 3D shape from texture and shading in the human brain. Cereb Cortex 18: 2416–2438.
- 116. Kanwisher N, Tong F, Nakayama K (1998) The effect of face inversion on the human fusiform face area. Cognition 68: B1–B11.
- 117. Andrews TJ, Schluppeck D (2004) Neural responses to Mooney images reveal a modular representation of faces in human visual cortex. Neuroimage 21: 91–98.
- 118. McKeeff TJ, Tong F (2007) The timing of perceptual decisions for ambiguous face stimuli in the human ventral visual cortex. Cerebral Cortex 17: 669–678.
- 119. Faes L, Nollo G, Porta A (2013) Compensated transfer entropy as a tool for reliably estimating information transfer in physiological time series. Entropy 15: 198–219.