Figures
Abstract
The ability to record from increasingly large numbers of neurons, and the increasing attention being paid to large scale neural network simulations, demands computationally fast algorithms to compute relevant statistical measures. We present an O(n) algorithm for calculating the Kendall correlation of spike trains, a correlation measure that is becoming especially recognized as an important tool in neuroscience. We show that our method is around 50 times faster than the O (n ln n) method which is a current standard for quickly computing the Kendall correlation. In addition to providing a faster algorithm, we emphasize the role that taking the specific nature of spike trains had on reducing the run time. We imagine that there are many other useful algorithms that can be even more significantly sped up when taking this into consideration. A MATLAB function executing the method described here has been made freely available on-line.
Citation: Redman W (2019) An O(n) method of calculating Kendall correlations of spike trains. PLoS ONE 14(2): e0212190. https://doi.org/10.1371/journal.pone.0212190
Editor: Bryan C. Daniels, Arizona State University & Santa Fe Institute, UNITED STATES
Received: June 12, 2018; Accepted: January 29, 2019; Published: February 14, 2019
Copyright: © 2019 William Redman. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: There is no data for this paper. The code used to evaluate the two principle methods discussed in the manuscript have been made available on Github (https://github.com/william-redman/Kendall-Correlation-for-Large-Spike-Trains) as mentioned in S1 Code.
Funding: The author received no specific funding for this work.
Competing interests: The author has declared no competing interests exist.
Introduction
The Kendall correlation was first introduced by Maurice Kendall in 1938 [1]. As a rank correlation, it takes into account the specific ordering of the elements of the sets it is correlating. A Kendall correlation, τ, equal to 1 is interpreted as the elements in the two sets being ordered in the same way. τ = −1 is interpreted as the elements in the two sets being ordered exactly oppositely. And τ = 0 is interpreted as the ordering of the two sets having no relation to one another.
Despite being used in a number of other scientific fields [2–4], it is only recently that the Kendall correlation has started to become appreciated, and implemented, in neuroscience. In particular, due to the usual sparseness of spike trains (i.e. the large number of zeros), the Kendall correlation has been shown to be particularly appropriate for computing pairwise correlations between spike trains, especially as compared to Pearson’s correlation [5–7]. Recently, it was used to explore the place field structure of place cells in the hippocampus [7], and generally pairwise correlations can be useful for revealing aspects of the behavior of the recorded, or constructed (in the case of computational/theoretical studies), networks. We note that for the remainder of the paper, by spike train we mean specifically a vector of length n whose ith element is a 1 if the corresponding neuron fired at least once during the ith time bin of the recorded interval and 0 otherwise. This is a frequently used way to talk about spike trains and is appropriate if firing is particularly sparse or if the time bin size is sufficiently small.
A simple, non-optimized, way of computing the Kendall correlation of two row vectors, X and Y, is MATLAB’s function, corr(X, Y, ‘Type’, ‘Kendall’). On MATLAB’s website [8], they define the Kendall correlation as
(1)
where
, and
(2)
However, as additionally stated, MATLAB’s function also has a normalization constant in the calculation of τ that adjusts for ties [8]. A Kendall correlation that takes this additional consideration into account is often referred to as τb in the literature [9]. Therefore, the true way in which MATLAB calculates the Kendall correlation of the row vectors X and Y is
(3)
where n0 = n(n − 1)/2, n1 = ∑i ti(ti − 1)/2, and n2 = ∑j uj(uj − 1)/2. The sums of n1 and n2 are over all the distinct values X and Y take (respectively), and ti is the number of elements in X equal to the ith distinct value of X (uj is the same, but for Y).
As can be seen from the definition of K, calculating τ requires summing over many of the pairs of values in X and Y (in fact, n(n − 1)/2 pairs, which means that the run time is O(n2)). For large spike trains, this results in a large computation time. For this reason, a faster, O(n ln n) method was developed [10], which makes use of the existence of a mapping between sorting and Kendall correlation. Additional work has been done using sorting and balanced tree structures in cutting edge ways to decrease the run time of other O(n ln n) methods [11]. While these methods—we will below consider specifically Knight’s method [10]—have great power because they are valid for arbitrary vectors, like the O(n2) method implemented by MATLAB, the generality is unnecessary for computing the Kendall correlation of spike trains. Below, we specifically take the inherent structure of spike trains (that is, that their elements take values only from {0, 1}) under consideration to derive a faster method of calculating Kendall correlations specific to spike trains. We show that our new method is O(n) and then examine how much faster our method is than Knight’s method under various conditions.
Materials and methods
As mentioned above, the motivating idea for the following method is that, since spike trains take values only in {0, 1}, by taking this fact under consideration, we might be able to speed up the calculation of the Kendall correlation. In particular, we show that we can write an explicit formula for K (from Eq (1)) that can be evaluated very quickly—in fact, in O(n).
Considering Eq (2), we see that there are two principle cases we need to consider when calculating K: the case where Xi and Xj are in the same order as Yi and Yj (i.e. where ξ*(Xi, Xj, Yi, Yj) = 1), and the case where they are in the opposite order (i.e. where ξ*(Xi, Xj, Yi, Yj) = −1). The third case, ξ*(Xi, Xj, Yi, Yj) = 0, obviously doesn’t contribute to the value of K. We now consider these two cases separately.
Same order case
This case happens only when Xi = Yi = 1 and Xj = Yj = 0, or when Xi = Yi = 0 and Xj = Yj = 1 (for i < j).
We define the active set of X to be
(4)
where 1 ≤ i ≤ n. We similarly define the active set of Y, AY.
We now define the combined active set, or the set of positions in the spike trains such that Xi = Yi = 1, as
(5)
Now let N = {1, 2, …, n}. We define the silent set of X as
(6)
where ⋅\⋅ is the set minus operator. We similarly define the silent set of Y, SY.
We now define the combined silent set, or the set of positions in the spike trains such that Xj = Yj = 0, as
(7)
With Eqs (5) and (7), we can find the contribution to K from this case. The number of ways ξ*(Xi, Xj, Yi, Yj) = 1, K+, is
(8)
where |⋅| is the function that returns the number of elements of the set. We see clearly that the first sum in K+ is the number of ways Xi = Yi = 1 and Xj = Yj = 0, and the second sum in K+ is the number of ways Xi = Yi = 0 and Xj = Yj = 1.
By the relationship between the two sums in Eq (8), we can simplify K+ to be
(9)
Opposite order case
This case happens only when Xi = Yj = 1 and Xj = Yi = 0, or Xi = Yj = 0 and Xj = Yi = 1 (for i < j).
We define the difference of X as
(10)
We similarly define the difference of Y, ΔY. ΔX is the set of positions in the spike trains where Xi = 1 and Yi = 0 (vice versa for ΔY).
With these we can now find the contribution to K from this case. The number of ways ξ*(Xi, Xj, Yi, Yj) = −1, K−, is
(11)
where the first sum in K− is the number of pairs (i, j) (where i < j) such that Xi = Yj = 1 and Xj = Yi = 0, and the second sum in K− is the number of pairs (i, j) such that Xi = Yj = 0 and Xj = Yi = 1.
Again, the sums are related (as they were in Eq (8)), so we can re-write K− as
(12)
Ties
The final thing needed in order to calculate K is the number of tied pairs in X and Y, n1 and n2. This is easy in the case of spike trains, as the number of ties for the value 1 is just the sum of all the elements in the train, and the number of ties for the value 0 is just n minus that sum. Therefore, using the equation given for n1, we have
(13)
The same is true for n2 (with Y in place of X).
Therefore, with Eqs (9), (12) and (13), we can write the Kendall correlation, Eq (3), of two neural spike trains as
(14)
where K+, K−, n0, n1, and n2 can be found with the formulas we have given for them. Note that Eqs (5), (7), (9), (10), (12) and (13) are all linear in n, i.e. O(n). Therefore, Eq (14) is O(n).
Comparison
To compare the presented method, Eq (14), with Knight’s method and MATLAB’s method, we created random binary vectors with a specified “sparseness”. Here sparseness refers to the expected fraction of 1s present in the vectors (or, in the neural context, the expected activity over a given time interval). We generated these vectors by using MATLAB’s rand function, with which we generated 1 × n vectors with elements uniformly drawn from (0, 1) [12]. We then set every element in each vector that had a value less than the sparseness we specified to 1, and all other elements to 0. Put another way, if Xrand was our random 1 × n vector with elements drawn from (0, 1), then we used the transform
(15)
We then used MATLAB’s method, Knight’s method, and our method to calculate the Kendall correlation of and
(where
was similarly generated). To record the time it took for each method, we used MATLAB’s built-in tic toc function [13]. We did all of the calculations on a 2014 MacBook Air (1.4 GHz Intel Core i5) running MATLAB 2015a.
For details of how we implemented Knight’s method, see the S1 Text.
Results
The results of comparing our method to Knight’s and MATLAB’s methods, are shown in Fig 1. Unsurprisingly, both our method and Knight’s method show considerable advantage over the O(n2) method that is implemented by MATLAB [8] (Fig 1a). However, our method is definitively faster. Importantly, this holds true for a range of sparseness values (Fig 1b), although our method shows a slight slowing down for larger sparseness values, while Knight’s method does not. Our method is on average ≈ 35 times faster for a sparseness of 25% and ≈ 60 times faster for a sparseness of 1%. Because a sparseness of 25%, the maximum we tested, is unrealistic for any neural simulation or recording, our method is faster than Knight’s in a neurally plausible regime.
(a) The run time as a function of spike train length using Knight’s method (black), our method (red), and the standard MATLAB method (green) for a sparseness of 5%. N = 10 and error bars are standard deviation. (b) The run time as a function of spike train length for different sparseness values: dotted line (1%), dashed line (5%), solid line (25%). N = 100, error bars are standard deviation, and colors are the same as in (a).
Finally, for all the correlations between spike trains we computed, we checked that the two Kendall correlation values were within 10−12 of MATLAB’s Kendall correlation function (see Table 1). Therefore, we feel confident that our method is correct and equivalent (up to machine error) to MATLAB’s method.
Kendall correlation of the spike trains listed at the top of the table (both with length 104) for the three methods.
Discussion
We have presented a novel method to calculate Kendall correlations of large spike trains, and have demonstrated its advantage (in terms of computation time) to the standard for fast Kendall correlation computation [10]. We achieved this by specifically taking the structure of spike trains (the fact that they are made up of 1s and 0s) into consideration, and deriving explicit formulas for the components of the Kendall correlation (Eqs (9), (12) and (13)). These formulas are all linear in n, meaning our method is O(n), unlike Knight’s method which is O (n ln n). We have also, by way of computation, provided evidence that our method is correct and equivalent (up to machine error) to MATLAB’s standard method.
With a significantly faster method to compute the Kendall correlation between large spike trains, we hope that the Kendall correlation will become a more accessible tool for neuroscience. While we know there are faster ways to implement algorithms similar to Knight’s (as was explored in [11]) that may be faster than the method provided here, the simplicity of our method (a few linear equations) makes it much more appealing to neuroscientists who have limited technical knowledge and/or interest in computer science. We imagine it will be especially useful in computational/theoretical studies where large, sparse spike trains are frequently generated and whose pairwise correlations provide insight into the complex properties of the network. We hope that the fact that pairwise correlations over significantly longer time intervals (or equivalently, between spikes trains of longer lengths) can now be calculated quickly, more in-depth analysis of generated networks (in addition to analysis of observed/recorded networks) will be achieved.
Finally, we hope that our results make clear the usefulness of considering specifically the structure of spike trains when calculating certain quantities. We’re sure many other measures can be significantly sped up when taking this into consideration.
Acknowledgments
We thank Eliott Levy for fruitful discussion and mentorship. We thank the reviewers for their helpful and constructive comments that pushed us towards making our algorithm O(n). We dedicate this paper to Prof. David Cai, who was among the first to inspire us towards research in neural science.
References
- 1. Kendall M. A New Measure of Rank Correlation. Biometrika 1938; 30 (1–2): 81–89.
- 2. Slamon DJ, Godolphin W, Jones LA, Holt JA, Wong SG, Keith DE, et al. Studies of the HER-2/neu proto-oncogene in human breast and ovarian cancer. Science 1989; 244: 707–712. pmid:2470152
- 3. Giraudet AL, Al Ghulzan A, Auperin A, Leboulleux S, Chehboun A, Troalen F, et al. Progression of medullary thyroid carcinoma: assessment with calcitonin and carcinoembryonic antigen doubling times. Eur J Endocrinol 2008; 158:239–246.
- 4. Kelder T, Stroeve JH, Bijlsma S, Radonjic M, Roeselers G. Correlation network analysis reveals relationships between diet-induced changes in human gut microbiota and metabolic health. Nutrition & Diabetes 2014 Jun 30;4:e122.
- 5.
Press WH, Teukolsky SA, Vetterling WT, Flannery BP. Numerical Recipes: The Art of Scientific Computing. Cambridge: Cambridge Univ. Press 2007.
- 6. Soletta JH, Farfán FD, Felice CJ. Measuring Spike Train Correlation with Non-Parametric Statistics Coefficient. IEEE Latin America Transactions 2015 Dec
- 7. Neymotin SA, Talbot ZN, Jung JQ, Fenton AA, Lytton WW. Tracking recurrence of correlation structure in neuronal recordings. J. Neurosci. Methods 2017; 275: 1–9. pmid:27746231
- 8.
https://www.mathworks.com/help/stats/corr.html
- 9.
Agresti A. Analysis of Ordinal Categorical Data. 2nd ed. New York: John Wiley & Sons. 2010. ISBN 978-0-470-08289-8. https://doi.org/10.1002/9780470594001
- 10. Knight WR. A computer method for calculating Kendall’s tau with ungrouped data. J. Am. Stat. Assoc. 1966; 61 (314): 436–439.
- 11. Christensen D. Fast algorithms for the calculation of Kendall’s τ. Computational Statistics 2005; 20: 51–62
- 12.
https://www.mathworks.com/help/matlab/ref/rand.html
- 13.
https://www.mathworks.com/help/matlab/ref/tic.html