Parametric Copula-GP model for analyzing multidimensional neuronal and behavioral relationships

One of the main goals of current systems neuroscience is to understand how neuronal populations integrate sensory information to inform behavior. However, estimating stimulus or behavioral information that is encoded in high-dimensional neuronal populations is challenging. We propose a method based on parametric copulas which allows modeling joint distributions of neuronal and behavioral variables characterized by different statistics and timescales. To account for temporal or spatial changes in dependencies between variables, we model varying copula parameters by means of Gaussian Processes (GP). We validate the resulting Copula-GP framework on synthetic data and on neuronal and behavioral recordings obtained in awake mice. We show that the use of a parametric description of the high-dimensional dependence structure in our method provides better accuracy in mutual information estimation in higher dimensions compared to other non-parametric methods. Moreover, by quantifying the redundancy between neuronal and behavioral variables, our model exposed the location of the reward zone in an unsupervised manner (i.e., without using any explicit cues about the task structure). These results demonstrate that the Copula-GP framework is particularly useful for the analysis of complex multidimensional relationships between neuronal, sensory and behavioral variables.

Computing infrastructure We developed our framework and ran the majority of our experiments (described both in the paper and Supplemental Material) on an Ubuntu 18.04 LTS machine with 2 x Intel Xeon Gold 6142 CPU @ 2.60GHz and 1x GeForce RTX 2080 + 1 x GeForce RTX 2080 Ti GPUs. For training C-vine models, we used another Scientific Linux 7.6 machine with 1 x Intel Xeon Silver 4114 CPU @ 2.20GHz and 8 x GeForce RTX 2080 Ti GPUs.

Model selection for bivariate copulas
Synthetic data We generate artificial data by sampling from a copula mixture, parametrized in two different ways: 1. mixing concentrations of all copulas were constant and equal to 1/N (N = number of copulas), but copula parameters θ were parametrized by the phase-shifted sinus functions: where i is the index of the copula in a mixture, m = 1. For Clayton and Gumbel copulas, the absolute value of the sinus was used. The amplitudes A i were chosen to cover most of the range of parameters, except for extremely low or high θs for which all copula families become indistinguishable (from independence or deterministic dependence, respectively).
2. copula parameters θ were constant, but mixing concentrations φ were parametrized by the phase-shifted sinus functions (same as Eq. 1, with A i = B i = 1/N and m = 2). Such parametrization ensures that the sum of all mixing concentrations remains equal to one ( N i=1 φ = 1). Yet, each φ turns to zero somewhere along this trajectory, allowing us to discriminate the models and infer the correct mixture.

Identifiability tests
We tested the ability of the model selection algorithms to select the correct mixture of copula models, the same as the one from which the data was generated. We generated 5000 samples with equally spaced unique inputs on [0,1]. Both model selection algorithms were able to correctly select all of the 1-component and most of the 2-component models on simulated data. For simulated data with larger numbers of components (or 2 very similar components), the WAIC of the selected model was either lower (which is possible given a limited number of samples) or close to the WAIC of the correct parametric model. In other words, the difference between the WAIC of the correct model and of the best selected model never exceeded the WAIC test tol = 0.05, which we set up as a criteria for passing the test: ∆WAIC < WAIC test tol . Since all the tests were passed successfully, we conclude that both algorithms are capable of finding optimal or close-to-optimal solutions for copula mixtures.
A more detailed report on the model identifiability tests Tables A-E below illustrate the search for the best model. The copula model names in these tables are shortened to the first two letters, e.g. Gumbel becomes 'Gu', Frank becomes 'Fr'. The information in these Tables provides some intuition on the model selection process and the range of WAICs for the correct or incorrect models. The final selected models are shown in bold. Table A demonstrates that both greedy and heuristic algorithms can identify the correct single copula model. Some key intermediate models (M in Alg 1-2 in S1 Text) with their WAICs are listed in the table, along with the total duration of simulations (T, in minutes) on RTX 2080Ti for both algorithms. Table B shows the identification of the mixtures with 2 components, where the copula parameters θ were constant (independent of x) and mixing concentrations φ were parameterized by the phase-shifted sinus functions (Eq. 1). All of these models were correctly identified with both algorithms. The mixtures with 2 components, where the copula parameters θ varied harmonically (as in Eq. 1) but the mixing concentrations φ were constant, were harder to identify. Table C shows that a few times, each of the algorithms selected a model that was better than the true model (WAIC best − WAIC true < 0). The greedy algorithm made one mistake, yet the model it selected was very close to optimal. Such misidentification happens due to the limited number of samples in a given synthetic dataset.
Tables D-E show the model selection for 3 component models. Again, as in Tables B-C, either θ or φ was constant. Here, the model selection algorithms could rarely identify the correct model (due to overcompleteness of the mixture models), but always selected the one that was very close to optimal: WAIC best − WAIC true WAIC test tol . Note, that WAIC test tol is different from waic tol. We have set waic tol for comparison against Independent model to such a small value (10x smaller than WAIC test tol ) because we want to avoid making false assumptions about conditional independences in the model. Also note, that the WAIC of the true model depends on the particular synthetic dataset generated in each test. Therefore, the final WAIC in the left and in the right columns of Tables A-E can be slightly different (yet, right within WAIC test tol ).