Efficient "Shotgun" Inference of Neural Connectivity from Highly Sub-sampled Activity Data

Inferring connectivity in neuronal networks remains a key challenge in statistical neuroscience. The “common input” problem presents a major roadblock: it is difficult to reliably distinguish causal connections between pairs of observed neurons versus correlations induced by common input from unobserved neurons. Available techniques allow us to simultaneously record, with sufficient temporal resolution, only a small fraction of the network. Consequently, naive connectivity estimators that neglect these common input effects are highly biased. This work proposes a “shotgun” experimental design, in which we observe multiple sub-networks briefly, in a serial manner. Thus, while the full network cannot be observed simultaneously at any given time, we may be able to observe much larger subsets of the network over the course of the entire experiment, thus ameliorating the common input problem. Using a generalized linear model for a spiking recurrent neural network, we develop a scalable approximate expected loglikelihood-based Bayesian method to perform network inference given this type of data, in which only a small fraction of the network is observed in each time bin. We demonstrate in simulation that the shotgun experimental design can eliminate the biases induced by common input effects. Networks with thousands of neurons, in which only a small fraction of the neurons is observed in each time bin, can be quickly and accurately estimated, achieving orders of magnitude speed up over previous approaches.


Supporting Information -Text S1 A Likelihood Properties
We examine the profile loglikelihood in Eq. 8, divided by T for convenience: Note this expression is a sum over N separable components corresponding to rows of the W matrix (i.e., all the inputs a single neurons receives). Each component can be maximized separately, as long as we also use similarly separable log-priors (like the L1 penalty we use -Eq. 46 below). We exploit this to reduce the number of indices in the following derivations, by examining only the profile loglikelihood of a single row:

A.1 Gradient, Hessian and Concavity
It straightforward to find the gradient of the approximated profile loglikelihood L i . Clearly, ∇ Wi,· L j = 0, if i = j. Otherwise, Next, to obtain the Hessian, we take another derivative Next, we show that L i is concave. To do this, we need to prove that for any W and any vector z ∈ R N ×1 z ⊤ ∇ Wi,· ∇ Wj,· L i z ≤ 0 .
Therefore, we need to calculate the sign of Since (given T is large enough) Σ (0) is positive definite (since the expectation of Σ (0) is a covariance matrix), we can decompose Σ (0) = ΛΛ ⊤ . Denoting u Λ ⊤ W and v Λ ⊤ z we find that the last line can be written as (1) ≤ −v ⊤ v (2) ≤ 0 , where in (1) we used the Cauchy-Schwarz inequality u ⊤ v 2 ≤ u ⊤ uv ⊤ v, and in (2) we used the fact that v ⊤ v ≥ 0 for any vector v. Therefore, L i is concave in every row separately, and so jointly in all rows.

A.2 Maximum Likelihood Estimator
We wish to find W that solves 0 = ∇ Wi,· max b ln P (S|W, b) .
Using Eq. 21, we obtain We define so we can write Eq. 23 as 0 = Σ (1) i,· − a i W i,· Σ (0) , which is solved by Substituting this into Eq. 24, we have In conclusion, the ML estimator of W i,· is given by Eq. 25, where a i is given by Eq. 26. This estimate would coincide with any MAP estimate when T → ∞. Note that a finite solution exists only if the last expression has a real value. Interestingly, this ML estimate is a re-scaled version of the ML estimate in a simple linear Gaussian neuron model (in that case a i = 0.5). Though one can obtain a Gaussian model also by assuming that the neural input is small (e.g., due to weak weights), it will not have the same scale a i . Specifically, if the input is small we can linearize the profile loglikelihood (Eq. 8), and reduce it to the simple quadratic form 1 T log P (S|W, b) where, in the last line, we used the CLT approximation, denoted and also approximated Defining differentiating Eq. 30 and equating to zero, we obtain and 0 = 1 T ∇ Wi,· log P (S|W, b) where in the last line used Eq. 32 to simplify the result. Defining we obtain where in the last line we used the identity In conclusion, in Eq. 36 we get the same ML estimate, as in the case of a logistic neuron model (Eq. 25), up to a change in the proportionality constant a i . This constant can be found by substituting Eq. 34 into Eq. 35 and solving the resulting equation. Note, however, the following: 1. The MAP estimate can change between neuron models, due to the addition of the log prior.
2. Here, we assumed the input current is Gaussian given the output spikes (in Eq. 30), and that the input current variance is independent of the output spikes (Eq. 31), while in the logistic model, we assumed only that the input current is Gaussian (without conditioning on the input).
3. This generic result for the ML estimate (Eq. 35) can be explained geometrically. This weight vector is normal to the decision hyperplane (a classification rule for whether or not an output spike happened given the input spikes) that would be obtained by Linear Discriminant Analysis (LDA, [3]). This LDA decision hyperplane is the Bayes optimal decision rule given (wrong) assumption that the input spikes are Gaussian given the output spikes, and that the input spikes' S ·,t−1 covariance does not depend on the output spikes S i,t . Here, we have the same assumptions (in Eqs. 30 and 31), but on the input current U i,t , and not on the input spikes S ·,t−1 . However, nothing in our derivation would change if we use LDA's stricter assumptions instead. Therefore, we get the same solution as LDA.

B Simulation details
In this section we provide extensive details behind the numerical simulations. The code is available on github: https://github.com/danielso/Shotgun. All simulations were done using Matlab 2013b on an Intel i7-4500 laptop, with a Windows 8.1 operating system.

B.1 Network model simulation
We test the shotgun scheme on simulated spike data from a recurrent GLM-based spiking neural network. In sections 5 and 6 we model the simulated network with biologically plausible parameters from the mouse visual cortex. Here we describe those parameters and how they were chosen.
• Each time bin corresponds to 10 ms, which is compatible with the timescales of neuronal integration [4].
• In the mouse visual cortex, neuronal density is about ρ = 10 −4 [µm] −3 [5]. Consequently, if we image neural tissue from a volume of size V , we should expect to record, at most, from N = ρV neurons. We assume that all neurons from the volume are indeed captured (e.g., as in [6], where 411 neurons are recorded from a 200 × 200 × 100µm volume) and that these neurons' locations (centers of mass) are randomly distributed in the imaged volume. Specifically, to construct a network of size N , we uniformly sample N neuronal locations in a cube [0, L] 3 where L 3 = N/ρ.
• Following [7, Figure 1E], we determine the connection probability (the probability that W i,j is not zero) according to where p 0 = 0.2, d 0 = 200µm is the typical length scale, and d is the Euclidean distance between two neurons. Empirically, this gives an average connection probability (between any two neurons) near p conn ≈ 0.1, which is in the range estimated by [5,7].
• All outgoing weights from a neuron have the same sign, following Dale's law. 20% of the neurons are inhibitory, while 80% are excitatory.
• The mean firing rate was set to 5Hz (i.e., 0.05 probability of a spike in each time bin), similarly to cells receiving sensory input (for simplicity, we assume a constant signal). Specifically, since inhibitory neurons typically fire faster, the mean firing rate of the excitatory and inhibitory neurons was set to 4.5Hz and 7.5Hz, respectively. Firing rates were tuned by adjusting the biases b automatically using the stochastic approximation method [8]. The resulting distribution of b i typically had a mean near −2.9 and variance near 0.16. To interpret these values, recall the logistic spiking model (Eq. 1). Near these negative values, perturbing b i by ∆ approximately results in the mean spiking probability of neuron i being multiplied by exp(∆), up to some saturation level starting near 0.5. This exponential increase can be also observed experimentally, in [9, Fig 2C].
• The distribution of the excitatory synaptic strengths is known to be heavy-tailed, and typically described as log-normal over approximately 2.5 decades as in [10] and [9]. Therefore, as in [9], we sampled the excitatory weights from a log-normal distribution in which the mean is similar to the standard deviation. The distribution of the inhibitory weights is much harder to measure experimentally, and therefore is not known, to the best of our knowledge. Therefore, as in [9], the inhibitory weights were sampled from Gaussian distribution with the same variance as the excitatory weights. Inhibitory synapses are expected to have a greater effect on the spike generation process, since inhibitory synapses tend to be located on or near the soma of the nerve cell, whereas excitatory ones are most abundant on dendrites [11]. Thus, the mean of the inhibitory weight was set about four times stronger than that of the excitatory weights. This value promotes an excitation-inhibition balance [12,13], in accordance with the conditions on the weight ratios described in [12] were met. Specifically, for inhibitory to excitatory weights we used a mean of 1, for inhibitory to inhibitory weights we used a mean of 0.8, for excitatory to excitatory weights we used a mean of 0.25, and for excitatory to inhibitory weights we used a mean of 0.2. To avoid unrealistically low or high weight values (so we get a range of 2.5 decades), we introduced a lower bound at 0.01 and an upper bound at 3. Any weights crossing these bounds were re-sampled. For self connectivity, we used W i,i ∼ N (−2, 0.2) to account for the refractory period (which is typically stronger than any synapse).
• In the resulting model, the distribution of the excitatory weights (Fig S1 B) has a mode near 0.11, so most synapses' activations contribute an input which is about 30 times smaller than the "spiking threshold" (−b i ≈ 2.9). Therefore, for most synapses, a few dozen pre-synaptic spikes are typically required in order to generate a post-synaptic spike. This is similar to what is observed experimentally in [9, Fig 2C], in which most synapses (near the 0.1ns conductivity mode of [9, Fig  1A]) increase spike probability approximately 0.03 above chance level. This indicates that the weight magnitudes are in a physiologically reasonable range. Furthermore, the resulting model exhibits firing patterns similar to those observed in biological neural networks (Fig S1 A). Specifically, the distribution of spike correlations between neurons (Fig S1 C) and the distribution of population firing rates (Fig S1 D), are comparable to those observed in experimental data [14]. Note that we are being somewhat conservative here (assuming a worst-case scenario). This because the firing rate in [14] ranged from 0.3Hz to 4.5Hz, while the excitatory cells in our model fire at 4.5Hz. Therefore, at the same firing rates, our model will probably have somewhat lower correlations than in realistic data (which makes inference harder in our model). For example, using a firing rate of 1Hz instead, somewhat decreases the correlations magnitude ( Fig S1C) to a smaller range ([−0.03, 0.11], not shown).
• We used a double-serial observation scheme: the network was scanned serially using two devices in contiguous blocks with incommensurate periods, as described in section 3.5 (and Fig 1, I,J). The dwell time of both devices was 1 sec and π sec, respectively. These specific values were chosen to allow spike inference from the calcium traces.
• The rest of the parameters are individually set for each figure -N , the number of observed neurons, T , the observation times and p obs , the mean fraction of neurons observed at each time bin, i.e., the empiric observation probability in the shotgun scheme.
• To demonstrate robustness, in every simulation we always add another 0.2N neurons which are never observed (as in [15]). Though this number is rather arbitrary, our numerical results hold even it is somewhat increased (not shown). However, if the number of unobserved neurons is significantly larger than the number of observed neurons, we would be required to incorporate latent variables in the model, as in [16].

B.2 Single neuron simulation
In the network simulations, we examine networks with up to N = 2048 observed neurons, which are relatively easy to simulate on a single machine. Such network sizes are often used in theoretical studies. However, in these networks, each neuron has up to a few hundred inputs with non-zero weights, which is at least an order of magnitude less then the real numbers. To verify that our results hold for a more realistic number of inputs, we will also examine a simulation of a single neuron with many inputs. A single neuron can be much easier to simulate than a whole network, if we exploit the fact that our estimation method depends only on the approximate sufficient statistics (m, Σ (k) ).
Specifically, in our model an "output neuron" is observed together with N = 10626 input neurons, 968 of which have non-zero weights to the output neuron. An additional 152 inputs to the neuron (with non-zero weights) are never observed. In order to make sure that the input spikes have the correct statistics, we still need to run a network simulation. However, in order to avoid simulating the full network with N = 10626 neurons, we only simulate the spikes for the inputs with the nonzero weights, together with additional 200 inputs that have zero weights. In total the simulated network has only N = 1321 neurons -1320 input neurons which are connected to each other as in our network model (section B.1), and one output neuron connected to all the other neurons (using the same weight distribution), except 200.
These 200 (zero weight) input neurons are then used to imitate an additional 9306 (zero weight) input neurons that were not simulated. To do this, we sample (uniformly, with replacement) from the spike statistics (m, Σ (k) ) of these 200 neurons (together with output neuron) to generate the same statistics for the zero weight input neurons that were not simulated. Thus, with all the spikes statistics (m, Σ (k) ) in hand, we estimate the connectivity matrix W.  [14], using 20ms timebins. The distribution of ρ is comparable to that of real spike data [14, Fig 1d], where most correlations are weak, i.e. |ρ| < 0.1. (D) Distribution of the mean population firing rates (blue circles), i.e., the fraction of the cells spiking in a time bin, divided by the time bin duration (10 msec). For comparison, we plot the same distribution after the time bins have been randomly shuffled for each neuron (red asterisks). We see that population "bursts" are significantly more common than chance level, similarly to real data [14, Fig 1e].

B.3 Fluorescence observations and spike inference -implementation details
In section 7 we add another layer of complexity and infer connectivity from fluorescence traces. Here we describe the fluorescence model and how the spikes are inferred from fluorescence. We use the same network model as in section B.1, with N = 50 observed neurons and the length of the experiment T = 5.5 hours. The frame rate here is 100Hz (i.e., the time bin dt = 10 msec). The only difference in the network is that we decreased the mean spontaneous firing rates (to 3Hz for excitatory neurons, 4.5Hz for inhibitory neurons), simplifying spike inference. However, the decrease in the firing rate also effectively reduces the amount of available data, and consequently, the quality of the connectivity inference (see Fig  7).
We assume that the observed fluorescence traces are a noisy, low pass filtered version of the spikes (similarly to [17] and references therein), i.e. for each neuron i: where C t,i is the internal calcium concentration at time bin t, Y t,i is the observed fluorescence, ǫ t,i a Gaussian white observation noise with zero mean and standard deviation of 0.2 or 0.4 (these values were chosen to obtain typically noisy traces in Fig 10A), and γ = 0.97, which determines the time constant. For a time bin of dt = 10 ms, as we use here, the timescale of fluorescence decay is τ = dt 1−γ = 300 msec. This is similar to the decay timescale of τ = 285.4 msec of the GCaMP6f calcium fluorescence indicator, which we estimated from publicly available data [18] using the code in [17].
From the fluorescence traces, we infer the underlying spikes using a simple greedy "peeling" algorithm, similar to [19]. Briefly, we first convolve the trace with a low pass filter, matched to the filter of the slow calcium dynamics (Eqs. 38). The filter parameters are estimated from the simulated fluorescence traces using the code in [17]. Then we iteratively repeat the following steps: (1) identify a spike as the maximal peak of this trace, then (2) subtract from this trace the identified spike, convolved with the matched filter. We stop this process when there are no more peaks that, when subtracted, reduce the overall variance.

B.4 Quality Measures
We use four main measures to assess the quality of the estimated matrixŴ in comparison to the "ground truth" matrix W. First, we define The measures we use are: • The square root of the coefficient of determination (R 2 ): • Correlation: • Zero matching: • Sign matching: High values indicate high quality of estimation, with 1 indicating zero error. To simplify the presentation scale, if a measure becomes negative or imaginary, we set it to zero. The sign and zero matching measures depend on the choice of regularization parameter λ in the L1 penalty, which was adjusted obtain the expected level of sparsity. Also, the sign measure captures only those connections in which both the truth and the estimate are non-zero, which can be arbitrarily low.
To have a clear understanding of specificity-sensitivity tradeoff in our estimation procedure, for each type of weight (excitatory/inhibitory, denoted by ±), we additionally define the true positive rate (sensitivity) and the false positive rate (specificity) Using these measures on the inferred weights we draw a Receiver Operating Characteristic (ROC) curve, for each value of λ. The Area Under the Curve (AUC) for the ROC is a popular measure for the quality of the estimator.

C Sparsity inducing L1 prior
As most neurons are not connected, most of the off-diagonal terms in W are equal to zero, making W sparse. Taking this prior knowledge into account, we incorporate it into our estimates by placing a Laplace prior on the weights, effectively adding an L 1 norm penalty : for some set of sparsity parameters λ i,j . However, the diagonal elements W i,i of the connectivity matrix are typically negative, corresponding to the the cell's own (refractory) post-spike effects. Therefore, we set where λ > 0. This prior has a number of advantages: the resulting log-posterior of W (given the full spike train S and the other system parameters) is a concave function of W, and the maximizer of this posterior,Ŵ MAP , is often sparse, i.e., many values ofŴ MAP are zero. Plugging this prior in the posterior (Eq. 4), together with the simplified profile loglikelihood (Eq. 8), we obtain a LASSO-type problem for W [20]. By solving this objective we obtain a sparse Maximum A Posteriori (MAP) estimateŴ MAP .
There are many algorithms that can be used to solve such an objective. Usually, the most efficient [21] is the FISTA algorithm [22]; details are given in section C.1. In order to set the value of λ, we assume some rough prior knowledge (e.g., statistics from anatomical data) about the sparsity of Wi.e., the average number of connections of each neuron in the population. Using this knowledge, it is straightforward to set the value of λ using an exponential search algorithm, as explained in section C.2.
Repeat For k ≥ 1 compute

C.2 Setting λ using the sparsity constraint
As explained in section C, we use a sparsity promoting prior (Eqs. 46-47), which depends on a regularization constant λ, to generate and estimateŴ (λ) of the connectivity matrix. Though this constant is unknown in advance, we can set it using the sparsity level ofŴ (λ), defined as We aim for this to be approximately equal to some target sparsity level θ. This is done using a fast exponential search algorithm (a generalization of the binary search algorithm for unbounded lists, see Algorithm 2), that exploits the fact that spar (λ) is non-increasing in λ. This monotonic behavior can be observed from the fixed point of the FISTA algorithm Note that as we increase λ, the max operation in the right hand side produces zero for more and more components. Therefore, the number of zeros components is clearly non-decreasing with λ.

C.3 Lipschitz constant
In the FISTA algorithm (Algorithm 1), we are required to calculate the Lipschitz constant of ∇L (W). A Lipschitz constant l of a function f is defined through We obtain that, in our case, is the Lipschitz constant, where λ max [X] is the maximal eigenvalue of X. Begin by observing that for each row of the profile loglikelihood (Eq. 20) where we in (1) we used the fact that Σ (0) is positive semi-definite, and in (2) we use the definition of the maximal eigenvalue. If we want a single Lipschitz constant for the all the rows, then we simply maximize over all the rows, and obtain Eq. 49.

D The integral approximation
Recall that

D.1 Accuracy of the approximation
Our derivation of the simplified profile loglikelihood in section 2.1 used an integral approximation from [23] (Eq. 14), which we write here again for conveniencê ∞ −∞ log (1 + e x ) N x|µ, σ 2 dx ≈ 1 + πσ 2 /8 log 1 + exp µ In Fig S2, left, We see that the approximation becomes relatively less accurate at low values of µ. Recall that in our derivations µ was the empirical average of the input current to the neuron. Therefore, for low firing rates, the weight estimated using the simplified profile loglikelihood will become inaccurate. This can be partially corrected by adjusting the amplitudes of the weights after the estimation (section D.3). However, in section D.2 we present a more accurate (and less heuristic) approach, which approximates the gradient of the loglikelihood (instead of the likelihood itself). The advantage of this approach is that instead of Eq. 50, we are usingˆ∞ which is a more accurate approximation (Fig S2, right), that also appears in [23]. To derive this equation, simply differentiate Eq. 50 by µ after changing the integration variable to t = x − µ (and then change back). Note that [23] also reported better results using the expected gradient (calculated using Eq. 51) instead of the expected loglikelihood (calculated using Eq. 50), but it was not completely clear why is that the case.
For the parameter ranges we tested in our simulations the approximation in Eq. 51 worked quite well. However, if µ is sufficiently low, then even the approximation in Eq. 51 becomes too inaccurate. In that case, one can find more accurate ways to approximate either of these integrals. For example, we can add higher order corrections or divide the integral into regions; calculate the integral numerically (e.g., using Gaussian quadratures or sampling); or save the output of the integral, for different values of µ and σ, in a 2D look-up table. However, such methods typically slow down the algorithm. Another option is to replace the logistic neuron model (Eq. 1) with a more tractable one, such as in which the Gaussian integral on the expected loglikelihood (or its gradient) can be calculated exactly.  Figure S2. Visualization of the approximation in Eq. 50 (left) and Eq. 51 . For different values of µ and σ, we plot the ratio of the left hand side (calculated by sampling) divided by the right hand side. The approximation on the right is more accurate since the usually ratio is closer to one, and on a larger range.

D.2 Gradient adjustment
Next, we will calculate the gradient of the exact loglikelihood and then approximate the gradient using similar approximations as in section 2.1. In all derivations we will again assume G = 0. The end results (derivation in section D.2.1) are where we defined A few comments: 1. Note that the bias gradient (Eq. 52) is identical to the bias gradient of the approximate loglikelihood, since if the gradient of the bias is equated to zero, we just obtain Eq. 15 again. However, the gradient of the weights is not equal to same gradient of the approximate loglikelihood.
2. We can plug the maximizer of b into the gradient of W (Eq. 53), similarly to what we did with to obtain the profile loglikelihood, but the result does not simplify as nicely as before.
3. Using these expressions we again calculate analytically the adjusted versions of the Hessian (not shown) and the maximum likelihood (section D.2.2). 3 , similar to that of the unadjusted gradients.

The computational efficiency of calculating the adjusted gradients is O N
5. When using these algorithms in the context of the FISTA algorithm (section C.1), we heuristically use the same Lipschitz constant as before, multiplied by 100 (as a "safety margin", since this is only an approximation of the actual constant).

D.2.1 Derivation of adjusted gradient
Recall that, originally, Therefore, the bias gradient (Eq. 52) is where we used the following: 1. The central limit approximation [24].
2. Eq. 54-55, where µ i = U i,t T , σ 2 i = Var T (U i,t ), and the approximation in Eq. 51. Next, we calculate the weights gradient (Eq. 53) where we used the following: 1. We defined X t |Y t = y T as the conditional empirical mean over X t given that Y t = y, and used the fact that S j,t−1 T is also the empirical frequency of the event S j,t−1 = 1, 2. The central limit theorem [24], and we defined Var T (X t |Y t = y) as the conditional empirical variance.
3. The approximation in Eq. 51, and Eqs. 56-57, where we defined These last two equations give Eqs. 56-57, since and, similarly, where in (1) we made a purely heuristic approximation in which we calculate the conditional covariance Cov T (S r,t−1 , S k,t−1 |S j,t−1 = 1) as if S is Gaussian. As S is far from Gaussian, this approximation is inaccurate. However, using the accurate result, would require the third order moments of the spikes, which will increase computational complexity. Empirically, this seems unnecessary, since in our simulations the performance of the algorithm hardly changes if we simply ignore any dependence of the covariance on S i,t−1 and approximate As a middle ground between accuracy and computational complexity, we used the Gaussian approximation. This is a reasonable approximation since the resulting covariance is never negative, and equals zero if k = j or r = j, similar to Eq. 58. Note also the expression we used for S k,t−1 |S j,t−1 = 1 T is the same as the one we would have if used if we assumed that S was Gaussian. However, it is possible that under some circumstances (e.g., highly synchronous firing modes), the accurate expression (Eq. 58) will be required.

D.2.2 Maximum Likelihood
In this section, we analytically calculate the maximum likelihood solution after the gradient adjustment. The resulting solution (Eq. 73 below, with constants defined in Eqs. 67-69, 71 and 72) is different than the solution before the gradient adjustment (Eq. 25), as expected.
To find the maximum likelihood solution, we require that and The solution of Eq. 59, has been calculated already, and is given by Eq. 15 where f −1 (x) = ln x 1−x . Using Eqs. 53-57, the solution of Eq. 60 is given by Substituting 61 into this equation, and denoting we obtain Further denoting Eq. 63 simplifies to

or, in vector notation
Substituting this into the definition of c i and A i,j we have and Shifting terms and squaring Eq. 66, we obtain Defining and the quadratic equation becomes whose solution is where We chose the solution of the quadratic equation (±) by requiring that the sign of D i,j be equal to the sign of Σ (1) i,j (though this is somewhat heuristic). Substituting this into Eq. 65, we obtain Substituting this into Eq. 70, we obtain And therefore, the maximum likelihood estimate of the weights is where A can be calculated using Eqs. 67-69, 71 and 72.

D.3 Alternative method -amplitude adjustment
In some cases, we might want to use the original gradient in Eq. 21 , rather than the adjusted gradient. This tends to cause some inaccuracy in our estimates, and mostly in the weight gains and biases. To partially correct for this error, we can re-estimate the gains and biases in the following way. Suppose we obtain a MAP estimateŴ after using the profile likelihood (Eq. 8). Next, we examine again the original likelihood (without any approximation) We assume that the MAP estimate is accurate, up to a scaling constant in each row, so W i,j =a iŴi,j , and we obtain where in the last line we used the expected loglikelihood approximation with CLT again, and denoted (Recall Eqs. 9-10) Sampling z i,t from 76 (we found that about 10 3 samples was usually enough), we can calculate the expectation in (Eq. 74). Next, we can now maximize the likelihood in (Eq. 74) for each gain a i and bias b i , separately ∀i, by solving an easy 2D unconstrained optimization problem. The new gains can now be used to adjust our estimation of W.

E Alternative inference methods -unobserved spikes as latent variables
Conceptually, having missing spike observations in the data should make it much harder to estimate W from data. This is because, in this case, we need to modify the posterior (Eq. 4), and replace the complete likelihood P (S|W, b) with the incomplete likelihood P (S obs |W, b), where i.e., the set of all observed spikes from S. However, in order to obtain P (S obs |W, b) from P (S|W, b), we need to perform an intractable marginalization over an exponential number of unobserved spike configurations. The method we derived here sidesteps this problem. We infer the connectivity matrix W using a MAP estimate, derived using an approximation of the complete loglikelihood ln P (S|W, b) (Eq. 4). Though the complete loglikelihood includes all the spikes (even the unobserved ones), in its approximated form we can also handle missing observations. This is done by simply ignoring missing observations and renormalizing accordingly the sufficient statistics (Eq. 16-17) of the approximated loglikelihood (Eq. 8). This procedure does not affect the asymptotic value of the sufficient statistics, since the observations and spikes are uncorrelated.
In contrast, classical inference methods [25] treat the unobserved spikes as latent variables, and attempt to infer them using one of these standard approaches: • Markov chain Monte Carlo (MCMC) -which samples the latent variables and weights from the complete posterior.
• Expectation Maximization (EM) -which provides an estimator that locally optimizes the posterior of W given the observed data.
• Variational Bayes (VB) -which approximates the complete posterior using a factorized form.
We experimented with all these approaches on the posterior, without using any of the simplifying approximations used in our main method (section 2). In order to clarify the motivation for our main method, we briefly describe these alternative approaches below, including their advantages and disadvantages.

E.1 Markov chain Monte Carlo (MCMC)
We assume a sparsity promoting spike and slab prior on the weights (here, and everywhere in this section, δ (x) is Dirac's delta) i.e., each weight is either zero, with probability 1 − p 0 , or normally distributed with mean µ 0 and variance σ 2 0 , with probability p 0 . Note the standard L1 prior on W (Eq. 46) would not promote sparsity in this MCMC setting, since then every sample of W would be non-zero with probability 1. We use a Gibbs approach to sample jointly from P (S, W|S obs , b): first we sample S given the observed data and the current sample from W (section E.1.1), and then sample W given S (section E.1.2). In general, computational speed is the main disadvantage of the MCMC method -both for the sampling the spikes and for sampling the weights. For large networks, this approach can only be used if the sampling scheme is completely parallelized over many processors. However, MCMC performs rather well on small networks, similarly to our main method (section 2). An important advantage of the MCMC method over our main method is that it can be used even if some "neuron pairs" are never observed, i.e. if for some i, j, O i,t O j,t−k T for k = 0 or 1. Therefore, the MCMC method might be used to complement our main method to infer the weights in this case. It should also be noted that the MCMC approach offers somewhat simpler methods for hyperparameter selection (via sampling from the hyperparameter posterior), and easier incorporation into larger hierarchical models that can include richer prior information about the network and enable proper sharing of information across networks; these are important directions for future research.

E.1.1 Sampling the spikes
In the first step we draw one sample from the posterior of S given the observed data and an estimated W, using the Metropolized Gibbs sampler [26,27]. This sampler is quite simple and is highly parallelizable in this setting, because the graphical model representing the posterior p(S|S obs , W, b) is local: S ·,t is conditionally independent of S ·,t+2 given S ·,t+1 since the spiking at time t only directly affects the spiking in the next time step (i.e., model 1-2 can be considered a Markov chain in S ·,t ). Therefore we can alternately sample the spiking vectors S ·,t at all odd times t completely in parallel, and then similarly for the even times. In more general GLM models, where the neuronal input depends on previous k spikes, i.e., {S ·,t−k } k l=1 , the algorithm could be executed using T / (k + 1) parallel computations. We note that more sophisticated sampling methods have been developed for this type of problem [28]; these specialized methods are significantly more efficient if implemented serially, but do not parallelize as well as the simple Gibbs-based approach we used here. The performance of the sampler is exemplified in Fig S3, bottom. As can be seen in this figure, usually (unless connectivity weights are very high) the spike sampler can predict the spikes only in a "small neighborhood" of the visible spikes.
We denote here S /(i,t) to be all the component of S without the S i,t component. In order to do Gibbs sampling, we need to calculate where we can neglect any additive constant that does not depend on S i,t . On the right hand side, the first term is Therefore, we can sample the spikes from Note that, for a given i, i i,t depends only on spikes from time t − 1 and t + 1. Therefore, S i,t samples generated at odd times t are independent from samples S i,t ′ generated at even times t ′ . Therefore, we can sample S i,t simultaneously for all odd times t, and then sample simultaneously at all even times t ′ . Such a simple block-wise Gibbs sampling scheme can be further accelerated by using the Metropolized Gibbs method [26], in which we propose a "flip" of our previous sample. So if S i,t is our previous sample and S ′ i,t is our new sample, we propose that S ′ i,t = 1 − S i,t and then accept this proposal with probability If the proposal is not accepted, we keep our previous sample S i,t .

E.1.2 Sampling the weights
In the second step, we sample W given S. We first note that, with the spike and slab prior, again the posterior factorizes over i: In order to do Gibbs sampling, we need to calculate where, as before, we can neglect on the right hand side any additive constant that does not depend on W i,j . The first term on the right hand side is where in both approximations, we initially assume that a single weight is typically small, i.e. W i,j ≪ 1. Therefore, denoting we can write Therefore, Assuming a spike-and-slab prior we can use standard Gaussian completion to do spike-and-slab completion, re-normalizing to obtain a proper spike-and-slab distribution. This gives with We can than proceed and sample W ·,j from this spike-and-slab distribution (in Eq. 82) -sampling W i,j simultaneously for all i. Now, since we used an approximation that assumes the weights are weak, so this sampling is not exact. Therefore, even if this approximation is very good, we cannot use direct sampling, or the error will accumulate over time catastrophically. To correct his, we use this approximation as a proposal distribution in a Metropolis Hastings simulation scheme [27]. Alternatively, if we do not want to assume weak weights, then we should replace the Taylor expansion around zero in Eq. 80, with a Taylor expansion around the mode of W i,j . Such a Laplace approximation approach, similar to the approach discussed in [29], is slower, yet can be more accurate (the acceptance rate was close to one in simulations), especially when the weights are not necessarily small.

E.2 Expectation Maximization (EM)
The EM approach is similar to the methods discussed in [30,31]. The E step requires the computation of an integral over p(S|S obs , W). Since this integral is high-dimensional and not analytically tractable, we again resort to MCMC methods to sample the spikes, using the same sampler as before. We found that in most cases it was not necessary to take many samples from the posterior; for large enough network sizes N (and for correspondingly long experimental times T ), S is large enough that a single sample contains enough information to adequately estimate the necessary sufficient statistics in the E step, after an initial burn-in period (30 steps was typically enough). See [32] for further discussion. In the M step we perform maximization over the log-posterior averaged over p(S|S obs , W), with an L1 prior (Eq. 46). This can be done using FISTA (section C.1), or any other tool for L1-penalized GLM estimation, as in [30,31]. The EM approach is is still rather slow, since we need to sample from the spikes. Moreover, the EM approach typically suffered from shrinkage (a decreased magnitude of the non-zero weights) and exhibited worse performance than the MCMC approach (data not shown).

E.3 Variational Bayes (VB)
Using the standard VB approach [25] and the spike and slab prior on the weights (Eq. 77) we can approximate the posterior P (S, W|S obs , b) using a fully factorized distribution in which the factors are calculated using ln q i,t (S i,t ) = ln E /Si,t P (S, W|S obs , b) ln q k,r (W k,r ) = ln E /W k,r P (S, W|S obs , b) , where, for any random variable X, E /X is an expectation performed using the distribution Q (S, W|X). This calculation proceeds almost identically to the derivation of the Gibbs samplers (section E.1), except we need to perform the expectation E /X over the results. These integrals can be calculated using the CLT and the approximations given in [23, sections 2.4 and 4.1].
The factorized form in Eq. 86 assumes that, approximately, all the weights and all the spikes are independent. With this method, which is somewhat faster than MCMC, the mean firing rates of the neurons could be estimated reasonably well, as is exemplified in Fig S3, bottom. Unfortunately, the weight matrix W cannot be estimated well if fewer than 30% of the neurons are observed in each timestep. This is in contrast to the other methods, in which the observed fraction can be arbitrarily low. We believe this happens because the VB approximation for the spikes ignores spike correlations -which determine the sufficient statistics in this case (as suggested by Eq. 8). However, it is possible that more sophisticated factorized forms (which do not assume spike independence) will still work. . Estimating the spikes using a latent-variable approach with a known network connectivity (W, b) -visualization. Each raster plot shows the spiking activity in a circular network with local connectivity (spikes are in white). First we show the true spiking activity (top). For neurons 10 − 20 and times 18 − 33 the spiking activity is unobserved. Then, we try to estimate this activity using the Variational Bayes method (VB, middle), or using Gibbs sampling (bottom). In the unobserved rectangle, the shade indicates the probability of having a spike -between 0 (black) and 1 (white). For visibility, we used saturated shades in the two bottom figures (i.e., values above 0.4 are shown in white). Each neuron in the network has excitatory connections to its neighbours and self-inhibition. Given this connectivity, we can see both methods give reasonable estimates near the edges of unobserved rectangle. Further from the edges, the estimation becomes less certain and converges to the mean spike probability for that neuron.