Cluster tendency assessment in neuronal spike data

Sorting spikes from extracellular recording into clusters associated with distinct single units (putative neurons) is a fundamental step in analyzing neuronal populations. Such spike sorting is intrinsically unsupervised, as the number of neurons are not known a priori. Therefor, any spike sorting is an unsupervised learning problem that requires either of the two approaches: specification of a fixed value k for the number of clusters to seek, or generation of candidate partitions for several possible values of c, followed by selection of a best candidate based on various post-clustering validation criteria. In this paper, we investigate the first approach and evaluate the utility of several methods for providing lower dimensional visualization of the cluster structure and on subsequent spike clustering. We also introduce a visualization technique called improved visual assessment of cluster tendency (iVAT) to estimate possible cluster structures in data without the need for dimensionality reduction. Experimental results are conducted on two datasets with ground truth labels. In data with a relatively small number of clusters, iVAT is beneficial in estimating the number of clusters to inform the initialization of clustering algorithms. With larger numbers of clusters, iVAT gives a useful estimate of the coarse cluster structure but sometimes fails to indicate the presumptive number of clusters. We show that noise associated with recording extracellular neuronal potentials can disrupt computational clustering schemes, highlighting the benefit of probabilistic clustering models. Our results show that t-Distributed Stochastic Neighbor Embedding (t-SNE) provides representations of the data that yield more accurate visualization of potential cluster structure to inform the clustering stage. Moreover, The clusters obtained using t-SNE features were more reliable than the clusters obtained using the other methods, which indicates that t-SNE can potentially be used for both visualization and to extract features to be used by any clustering algorithm.

1 Introduction population size in either 80 or 48 dimensional input space. Let X = {x 1 , .....x n } ⊂ p denote a set of vector data representing n spikes generated by one or multiple neurons. The coordinates of x i are voltage samples that describe a spike event (they are always voltage samples in this article). The non-degenerate crisp c-partitions of the n objects in a set X can be represented by a c × n matrix U in M hcn , written in terms of the c crisp subsets of it (the clusters X i ) as Finding clusters in X comprises three steps: deciding how many clusters (c) to look for; 151 constructing a set of candidate partitions {U ∈ M hcn } of X; and selecting a "best" partition 152 from CP (cf. equation (2) below) using a cluster validity index (CVI).  [Theodoridis, 2009]. In order to find a low-dimensional 164 subspace suitable for projecting the input data, PCA projects the data along the directions 165 given by the leading eigenvectors of the data covariance, i.e., the directions associated with 166 the largest eigenvalues of the sample covariance matrix. In other words, PCA seeks a low-167 dimensional representation that explains as much variance of the input data as possible. 168 In neuroscience research, another common approach is to extract features of the waveforms 169 that have a physical meaning: e.g. peak (P), valley (V), and energy (E) [Hattori et al., 2015, 170 Truccolo et al., 2011], hereby called PV-E. Another method based on wavelet transforms that 171 enables visualizing the data in the wavelet coefficient subspace has also been successfully 172 implemented in clustering packages such as Waveclus and Combinato [Niediek et al., 2016, 173 Quiroga et al., 2004]. 174 Here, we will also consider two nonlinear dimensionality reduction techniques. The first of 175 these is t-SNE (t-student Stochastic Neighbor Embedding), developed by van der Maaten and 176 Hinton [2008]. It works by converting Euclidean distances between high-dimensional input 177 data into conditional probabilities. In doing so, t-SNE converts the geometric notion of 178 similarity into a statistical concept: if x j is a neighbor of x i , then the conditional probability 179 p j|i is high. Then, t-SNE finds low-dimensional representations y i and y j of x i and x j by 180 minimizing the discrepancy between the upspace p j|i and downspace conditional probabilities 181 q j|i , technically achieved by minimizing the Kullback-Leibler divergence between them. The 182 objective of t-SNE is to minimize the sum of the divergences over all the data points. 183 Two features of t-SNE should be noted. First, it is not a linear projection like PCA but rather 184 has a non-convex cost function, so its output may be different for different initializations.  194 This limitation affects each datapoint separately based on the local density of the data. This 195 is the feature that enables t-SNE to avoid crowding points in the center of the map so that 196 cluster structure of the data in the upspace data is often seen in the tSNE downspace pro-neuronal waveform data. The optimal choice of perplexity is dependent on the number of 202 points (spikes) in the dataset. We found that for neuronal datasets with thousands of spikes 203 (data points), as long as the extreme values in the parameter ranges are not selected, the 204 t-SNE algorithm is not very sensitive to changes in perplexity. On the other hand, the 205 reliability of t-SNE visualizations seems to decrease as the number of samples decreases. See 206 [Mahallati et al., 2018] for an example. 207 We also consider another traditional nonlinear dimensionality reduction technique called the 208 Sammon mapping [Sammon, 1969], which is one form of multidimensional scaling. Mul-209 tidimensional scaling (MDS) seeks a low dimensional embedding of the input data while 210 preserving all pairwise Euclidean distances (In a more general setting, t-SNE can be inter-211 preted as a form of probabilistic MDS). However, high-dimensional data usually lies on a 212 low-dimensional curved manifold, such as in the case of the Swiss roll [Tenenbaum et al., 213 2000]. In such cases, preserving pairwise Euclidean distances will not capture the actual 214 neighboring relationships: the actual distance between two points over the manifold might 215 be much larger than the distance measured by the length of a straight line connecting them, 216 i.e., their Euclidean distance). Sammon mapping improves upon classic multidimensional 217 scaling by directly modifying its original cost function, i.e., the distortion measure to be 218 minimized. In particular, the Sammon mapping cost function weights the contribution of 219 each pair of data points relative to the overall cost by taking into account the inverse of their 220 pairwise distance in the original high-dimensional input space. In this way, Sammon map-221 ping often preserves the local structure of the data better than classical multidimensional 222 scaling.

223
While these five methods do not all produce lower dimensional data with an analytic pro-224 jection function, we will call all downspace data sets projections. There are a number of imaging techniques that can be applied directly to the upspace 227 data before clustering it. Here we describe the iVAT method described in [Havens and 228 Bezdek, 2012], which is a generalization of the original VAT algorithm given by [Bezdek and 229 Hathaway, 2002]. Improved Visual Assessment of Tendency (iVAT) is a visualization tool 230 that uses the dissimilarity matrix, D, of the data to display potential cluster structure. The 231 steps of the iVAT method are the following. The vectors in the dataset are represented as 232 vertices in a complete graph, with the distances between them the weights of the graph.

233
The algorithm first finds the longest edge in the graph. Then, starting at either end, it 234 finds the minimal spanning tree (MST) of D based on Prim's algorithm. Then, it reorders The most important points about this display technique are 248 that it is applied directly to (a distance matrix of) the upspace data, so there is no distortion 249 of the structural information introduced by a feature extraction function from the upspace 250 to a chosen downspace, and iVAT preserves the physical meaning of the measured features.

251
While any vector norm can be used to build an input matrix D(X) from a set X of feature 252 vectors, the only distance used in this article is Euclidean distance. It is very important to 253 understand that an iVAT image merely suggests that the input data has a certain number of 254 clusters. Since iVAT can produce images from data of arbitrary dimensions, we can use it (or 255 its scalable relative siVAT, Kumar et al. [2017]) to make a visual estimate of possible cluster 256 structure in any upspace. While the iVAT algorithm is occasionally "wrong" (misleading), 257 iVAT images usually provide some idea about the cluster structure of the input data [Bezdek, labeled or unlabeled data. This suggests that iVAT might be regarded as a tool for "taking 264 a peek" at a specific type of data structure in the input space. Cluster validity comprises computational models and algorithms that identify a "best" mem-267 ber amongst a set of candidate partitions (CP) where c m and c M are the minimum and maximum specified values of the numbers of clusters 269 sought.

270
The approach to identify a "best" partition U (and concomitant value of c) in CP can be distance between any two points in that subset ( (X k )), and the distance between subsets 277 X i and X j the minimum distance between any two points of the two subsets (δ(X ij )).  Figure 1 shows two cases from the dataset with c=3. The colors in Figure 1 correspond to the three data labels. Bear in mind, as you view this and subsequent figures, that in the 313 real case, the data are always unlabeled, so the projected data will be just one color, and 314 the apparent structure will be much less evident than it seems to be in these figures. Figure   315 1(a) is a 'good' case in which all the algorithms map the spikes to projections with visually  suggest that Z contains 5 clusters. Subset Z 4 is isolated in view 7(b), but not more isolated 378 than subset Z 2 , so t-SNE is less assertive about the anomalous nature of Z 4 than iVAT is.

379
If the labels are available, reclustering might be applied to 9, 7, 5, 8 and/or 3,6,10 to make 380 a finer distinction between spike subsets. If the labels are unavailable, it's hard to see what 381 can be inferred from the t-SNE projection about Z beyond the suggestion provided by view 382 7(c) that we should seek 5 clusters in Z. 383 We conclude this example with some general observations. First, the iVAT image is unique, (b)Z 10 :mean profiles of 10 subsets in Z   We turn into dataset-2 to further investigate the limits of discernible spike subsets since as 390 mentioned previously, sets drived from dataset-2 are combinations of real spikes originated 391 from pyramidal cells in rat hippocampus (ref to [Henze et al., 2000] can be built from them at c=2, c=3, and c=4. Figure 8 shows the four representative subsets 398 (all nine subsets of the waveforms are shown in Figure 15).   whereas, t-SNE maps the four subsets with arguably enough clarity to declare that X1234 464 has four clusters. It can be argued that while the input has c=4 labelled subsets, the primary 465 visual evidence does not support c=4, nor will there be a "best" set of clusters in the upspace 466 at this value of c. In other words, just because the subsets have 4 labels does not guarantee 467 that a cluster analysis of the data will agree. When you imagine the scatterplots in Figure   468 11 without colors there are not four distinguishable clusters present.

PV-E Wavelet t-SNE Sammon Waveforms iVAT
(a) (b) Figure 11: X1234 mixture at c=4 In summary, we note again that although the subsets of spikes were obtained from indepen-   dataset-2 (Recall that many authors refer to c-means as "k-means," where k is the notation 489 chosen for the number of clusters: either notation is correct). We prefer to reserve (k) for 490 the universally accepted description of the k Nearest Neighbor rule.  The c-means clustering algorithm is executed on each low-dimensional representation ob-507 tained with the five methods. The number of clusters to be generated (i.e., c) is set equal 508 to the number of labeled subsets in the ground truth partition (i.e., 2, 3, 4,..to 9 for cases 509 in dataset-2 and 3, 4, 5, ....to 21 for cases in dataset-1). For each computed partition, we 510 calculate the measure of agreement with ARI between the computed waveform memberships 511 and the memberships as given by the ground truth partition (recall that our data is labeled).  The ARI maximizes at 1, so clustering in the 2D t-SNE downspace data provides c-means 518 clusters that, on average, slightly better match the ground truth partitions than c-means 519 clusters in the input space. In this section we compare the performance of c-means on t-SNE projections of the input 531 data, which achieved the best overall ARI ranking among the 6 methods discussed in Section 532 3.2.1, to partitions on the input data obtained with a well-known spike sorting algorithm 533 called Osort.

534
Osort is a greedy algorithm that clusters data in the upspace (input space). The idea is that 535 the limit in discerning two waveforms from each other is the noise in the data. Therefore, 536 the algorithm compares the squared Euclidean distances between spikes, with the square of 537 the standard deviation of the raw signal as a measure of noise. If the distance is lower than 538 noise, the two waveforms are paired in the same cluster; if the distance is higher than the This provides further evidence that the space in which the clustering is performed plays an 558 important role in the quality of the unit sorting. Figure 13 shows that clustering with c-means 559 in the 2D t-SNE data works fairly well with both real and simulated data. Figure 13 also 560 shows that c-means clustering in the input space is second best amongst the methods tested, 561 and we think that upspace clustering should be a default option in all cases. We remind 562 readers that other clustering algorithms might yield different results, and that almost every 563 clustering algorithm will deliver clusters at any value of c. on the data is a downside to using t-SNE to provide projected data for clustering.

603
In the first dataset, simulations were generated using average waveforms obtained from ex-

639
Since the cluster visualization in PCA or other feature spaces usually exhibits overlapped or 640 mixed clusters, it is common to leave a lot of non-outlier waveforms unsorted to get well sep-641 arated clusters. Our results show that the 2D t-SNE projection is the most reliable feature 642 extraction scheme we tested. We believe that t-SNE works well since it is a probabilistic-643 based approach that is appropriate for neuronal data. In a nutshell, the variability caused 644 by the noisy spikes can often be circumvented by converting the deterministic dissimilarity 645 measure between two waveforms into a probability of dissimilarities.

646
In this paper, we also demonstrated that the visual assessment of c from the iVAT images 647 is often possible, highlighting that if clustering in the upspace is preferred, a visualization 648 tool such as iVAT can be integrated into the package to inform the manual curation process.

649
However, there were cases, in particular when c was high (> 10), that the iVAT image could Let X i and X j be non empty subsets of p , and let d : p × p → + be any metric on p × p . Define the diameter of X k and the set distance δ between X i and X j as: δ(X i , X j |d) = min x ∈ X i y ∈ X j {d(x, y)} = δ SL (X i , X j |d).
GDI 33 (U |d) = min 1≤i≤c min 1≤j =i≤c wherev k = x∈X k x |X k | is the mean or centroid of the cluster. The notation |d in equations 689 3, 4, 5, 6, and 7 for ∆ and δ indicate that these formulas are valid for any metric d on the 690 input space. where,

697
• a = Number of pairs of data objects belonging to the same subset in U and V.

698
• b = Number of pairs belonging to the same subset in V but to different subsets in U

699
• c = Number of pairs belonging to the same subset in U but to different subsets in V.