Simultaneous coherent structure coloring facilitates interpretable clustering of scientific data by amplifying dissimilarity

The clustering of data into physically meaningful subsets often requires assumptions regarding the number, size, or shape of the subgroups. Here, we present a new method, simultaneous coherent structure coloring (sCSC), which accomplishes the task of unsupervised clustering without a priori guidance regarding the underlying structure of the data. sCSC performs a sequence of binary splittings on the dataset such that the most dissimilar data points are required to be in separate clusters. To achieve this, we obtain a set of orthogonal coordinates along which dissimilarity in the dataset is maximized from a generalized eigenvalue problem based on the pairwise dissimilarity between the data points to be clustered. This sequence of bifurcations produces a binary tree representation of the system, from which the number of clusters in the data and their interrelationships naturally emerge. To illustrate the effectiveness of the method in the absence of a priori assumptions, we apply it to three exemplary problems in fluid dynamics. Then, we illustrate its capacity for interpretability using a high-dimensional protein folding simulation dataset. While we restrict our examples to dynamical physical systems in this work, we anticipate straightforward translation to other fields where existing analysis tools require ad hoc assumptions on the data structure, lack the interpretability of the present method, or in which the underlying processes are less accessible, such as genomics and neuroscience.


INTRODUCTION
Modern science increasingly leverages machine learning on large datasets in the sciences, from electronic structure 1 to whole genome sequences 2 to distributed ocean sensor measurements 3 . Many of these datasets capture the dynamics of a system evolving in time, encoding trends with predictive power. Analyzing these datasets using a statistically robust and interpretable framework is a longstanding challenge that often involves clustering, or the unsupervised learning of coherent groups within the dataset.
Clustering is a notoriously challenging problem which, unlike supervised learning, features no direct measure of model success or validity and often requires heuristic assessments of effectiveness 4 . Thus, many classes of clustering algorithms have been developed for different problems. Some commonly used techniques include partitionbased methods such as k-means 5 , or their fuzzy counterparts 6 ; density-based methods such as DBSCAN 7 ; and connectivity-based methods such as divisive and agglomerative hierarchical clustering 8,9 .
Each of the aforementioned methods exhibits drawbacks with respect to a priori assumptions and algorithmic limitations. For example, partition-based clustering such as k-means requires the modeler to prescribe the number of partitions in a dataset before constructing the model. If multiple results are obtained from different values of k, these results are not interrelated; similarly, the a) Electronic mail: bhusic@stanford.edu b) Electronic mail: jodabiri@stanford.edu model cannot be used to determine relationships between the k clusters of a single model. While connectivitybased methods feature interrelated clusters, these also require the determination of where to cut the corresponding dendrogram to obtain the clustering result. Although density-based methods do not require a priori or a posteriori determination of the number of clusters to use, these methods are generally not robust to datasets containing a range of cluster densities 10 .
Here, we present a new method, simultaneous coherent structure coloring (sCSC), which minimizes the assumptions required in an unsupervised clustering task. sCSC focuses solely on the efficient separation of the most dissimilar states in the system, resulting in a quantitative structure that automatically captures the clusters in the dataset and their interrelationships without a priori knowledge of the system. The method is demonstrated for simulated and empirical systems of fluid and molecular dynamics, and its straightforward extension to other types of data is discussed.

BACKGROUND
The use of clustering for data analysis is ubiquitous. However, our motivation emerged from research on the identification of coherent structures from fluid dynamics. A variety of mathematical frameworks have been developed to identify coherent structures. The broad class of Lagrangian methods has been developed to describe flows that are unsteady (i.e., not well-summarized by instantaneous snapshots) in a way that is not dependent on their frame of reference (i.e., may not contain velocity or acceleration data). Two recent reviews of Lagrangian methods for the detection of coherent structures in fluids can be found in Refs. 11 and 12. Existing algorithms for coherent structure analysis that involve clustering exhibit various limitations. The fuzzy c-means approach presented in Ref. 3, for example, introduces a dynamic distance between particle trajectories, but ultimately requires the choice of c, i.e. how many clusters to use. To avoid explicitly choosing the number of coherent structures, a spectral clustering method was introduced in Ref. 13, which utilizes the spectral gap in the graph Laplacian to determine the number of coherent structures. However, it was subsequently shown in Ref. 14 that such a gap is only robust when the number of trajectories used exceeds 10 3 .
The method of coherent structure coloring (CSC), introduced in Refs. 14 and 15, was designed to address these and other limitations of clustering algorithms for coherent structure determination based on trajectories of particle spatial coordinates. In this work, we extend CSC in the context of its own limitations, as described below.
While we restrict the focus of the rest of the paper to clustering, this is not the only way to identify coherent structures from frame-independent particle trajectories. Over the past two decades, both the fluid and molecular dynamics communities have developed methods to identify "almost-invariant" sets through data-driven approximations to the Perron-Frobenius operator and its adjoint, the Koopman operator. The former, also referred to as the transfer or transition operator, propagates probability densities forward in time, whereas the latter propagates observables 16 .
For fluid systems, Dellnitz and Junge 17 , Froyland and Dellnitz 18 , and Mezić 19 used finite approximations to the Perron-Frobenius eigenfunctions to divide the space occupied by a dynamical system into almost-invariant sets and almost-cycles. At the same time, Schütte et al. 20 and Deuflhard et al. 21 introduced the use of approximations to the same operator, under a reversibility constraint, to determine the "metastable" states of molecular systems simulated on the atomic level.
A decade later, researchers in their respective fields independently determined equivalent algorithms for optimizing the estimation of the approximated eigenfunctions of the Perron-Frobenius and Koopman operators [22][23][24] . In both cases, linear models are generated using a datadriven, objective protocol to model highly nonlinear dynamics, where the eigenvectors and eigenvalues can be used to identify coherent sets.
In fact, in our final example in the current study, we utilize both types of methods on a molecular dynamics simulation dataset. We first create an optimized model that approximates the Perron-Frobenius operator 22 , which is difficult to visualize due to its high dimensionality. Then, to reduce our model to a visualizable and interpretable coarse-grained model, we use the clustering method presented in this work to identify the major coherent structures.

SIMULTANEOUS COHERENT STRUCTURE COLORING (sCSC)
Coherent structure coloring theory Many datasets we wish to explore in the physical sciences are generated by complex dynamical systems that exhibit instabilities and chaos. A key consequence of these processes is that states of the system (e.g. fluid particle trajectories or protein conformations) that are proximal but belonging to different coherent sets will separate exponentially faster as the system evolves than states belonging to the same cluster 25,26 .
On this basis, we previously hypothesized that these complex datasets can be clustered more robustly and effectively by amplifying state differences rather than state similarity 14,15 . The rationale for this approach is that the exponential separation of dissimilar states can provide more sensitive detection of clusters than a focus on state similarity, the latter requiring longer observation to become apparent 25,26 . In other words, we aim to identify coherent clusters indirectly, by prioritizing the separation of states with greatest dissimilarity and confidently ruling out the possibility of their membership in the same cluster. Those states that remain together after the separation process will subsequently emerge as belonging to the same cluster.
To amplify the dissimilarity between states, we solve an optimization problem to maximize a figure of merit z that quantifies total state dissimilarity in the dataset. Specifically, this figure of merit depends on a scalar value x i assigned to each state i in the system, where the squared difference in the scalar value assigned to each of pair states (e.g. (x 1 − x 2 ) 2 for states 1 and 2) is weighted by a measure of their dissimilarity. Formally, the clustering parameter z is given by where the summations of i and j are each taken over the full set of n states to be clustered, and a ij is an element of the adjacency matrix A containing the pairwise dissimilarity between states i and j. The construction of this matrix requires the calculation of n 2 = (n − 1)n/2 adjacency values. Example definitions of the (symmetric) pairwise dissimilarity can include the standard deviation for comparison of time-dependent signals, or the Jensen-Shannon divergence for comparison of probability distributions 27,28 . Both definitions represent measures of dissimilarity, where identical data points receive a ij = 0. Thus, assuming all data points are unique, the matrix A will be dense.
Given the adjacency matrix A, we seek to find state assignments x i that will maximize z, subject to the constraint that the magnitude of the n × 1 vector X containing the n scalar values x i must remain finite (e.g. to avoid the trivial case that maximizes z for x 1 = ∞ and x 2 = −∞). It is straightforward to show that the constrained optimization of equation 1 with finite X can be written as the generalized eigenvalue problem 29 : where D is a diagonal matrix with entries equal to the row-sums of the adjacency matrix, i.e. j a ij for each row i, and L = D − A is the graph Laplacian. This maximization is expressed using the Lagrangian form; see 14 for more details. Each of the n eigenvectors X n of equation 2 represents a solution that assigns to each state a scalar value x i based on its dissimilarity to the other states in the system. Those states with scalar assignments in each X that are most dissimilar can be presumed to belong to different clusters of the data when the data is partitioned according to that particular solution of equation 1. The eigenvector X 1 associated with the maximum eigenvalue λ 1 contains the scalar assignments x i that maximize the figure of merit z. This can be considered the single most effective partitioning of the dataset.
Given the analogy between this approach and the problem of fuzzy graph coloring 30 , wherein the connected nodes of a graph with large weights are assigned the most disparate values, we call this method coherent structure coloring (CSC) 14 . The technique has recently been demonstrated to successfully identify coherent eddies and jets associated with individual fluid particle trajectories in model geophysical flows 15 .

Simultaneous inclusion of multiple CSC solutions
A key limitation of the original CSC method 14 is that it relies on only a single eigenvector associated with the largest eigenvalue of equation 2. Hence, although multiple dimensions of dissimilarity are almost always present in real data, the method cannot simultaneously distinguish between multiple types of dissimilarity in a dataset. Moreover, the method applied to individual fluid particle trajectories in a subsequent study required a subjectively defined threshold to calculate eigenspace distances 15 , and it was shown to produce degenerate results for fluid particles in chaotic regions of the flow (cf. Importantly, because the adjacency matrix A introduced in the previous section is real and symmetric, the remaining eigenvectors associated with lesser eigenvalues provide additional, linearly independent (i.e. orthogonal) solutions for partitioning the data, albeit less effectively 31 . The key innovation of the present work is to use all of the eigenvectors in a top-down fashion to simultaneously cluster the system states.
To perform sCSC, we begin with the most effective partition given by the eigenvector associated with the maximum eigenvalue, and proceed through the set of orthogonal eigenvectors in order of decreasing eigenvalue. This approach simultaneously reveals the coherent sets of the system, and eliminates the subjective user intervention required in the previous method 15 .
Given a dissimilarity measure and resulting eigenvector solutions, the simultaneous coherent structure coloring (sCSC) algorithm begins by assigning to each state in the system a binary membership based on its corresponding scalar value along each orthogonal coordinate direction. A bifurcation is appropriate given that each one-dimensional coordinate has two extreme ends toward which the optimization of equation 1 pulls dissimilar states.
The states are bifurcated along each coordinate dimension by using agglomerative clustering with average linkage (although other linkages or splitting methods could be used for this step; see e.g. Ref. 32, Table 1.) and assigning to each state a value of 0 or 1 based on its membership within either of the two largest clusters of the resulting dendrogram. Each eigenvector contributes a separate bit to the binary code associated with each of the states in the system, with the leading bit corresponding to the largest eigenvalue and the remaining bits concatenated in decreasing order of their corresponding eigenvalues. Though we suggest using a bifurcation in general, the method does not prohibit the division of each eigenvector coordinate into three or more discrete bins, thus creating a k-way splitting and associated basek codes.
For each subsequent eigenvector, the bifurcation is performed for all data points (i.e. states), and each is assigned a 0 or a 1. For the kth eigenvector bifurcation, this enables 2 k numerically possible clusters (Fig 1). For example, the first splitting produces branches 0 and 1, and the second splitting enables the population of 2 2 unique clusters by appending 0 or 1 to each branch of the existing binary code ({00, 01, 10, 11}). However, it may be the case that the numerically possible branch 01 is not occupied because there is no data point that receives both a label of 0 in the first bifurcation and a label of 1 in the second bifurcation. Thus, we hypothesize that branch 0 (and its only occupied split, branch 00) evidences a coherent region of the data. In this way, a natural stopping criterion emerges from unoccupied bit codes during the binary splitting.
The binary codes generated by the aforementioned process can be visualized in a dendrogram, with each branch pair connecting those states that differ only at the least significant bit of their binary code. The length of each branch pair is a measure of the dissimilarity between the groups connected by the branches, and it corresponds to the value of the summation in equation (1) computed only over those states connected by the branches. Bits for progressively smaller eigenvalues are included at progressively lower levels of the dendrogram. The dissimilarity  (2)) that maximizes the dissimilarity measure. The solutions to the eigenvalue problem identify orthogonal processes in the system in order of their ability to separate the system; in this case, we have stars ↔ no stars, bright ↔ dark, and green ↔ blue, which we have asserted are decreasingly effective in explaining dissimilarity in this notional system. These three processes are bifurcated into two extremes (middle panel). Then, each state is encoded according to each bifurcation. For the first orthogonal process (stars ↔ no stars), we bifurcate the entire system. For the next orthogonal process (bright ↔ dark), we bifurcate the system separately and illustrate only states which become bifurcated along this division. For example, there is no state that contains stars and is dark, so branch 0 of the corresponding dendrogram is not further bifurcated. Finally, we bifurcate both branches 10 and 11 according to green or blue.
between the groups connected at lower levels therefore generally becomes smaller as well. While in principle the sCSC dendrogram should naturally truncate when no further splits occur, large amounts of data points or statistical noise may lead to insignificant (i.e., low-z) clusters or explore a combinatorially unfavorable number of splits.
In that case, one may choose to truncate the dendrogram after a certain number of eigenvectors according to visual inspection, or determine a cutoff based on the magnitude of z or the eigenvalue.
As in standard divisive and hierarchical clustering methods, the clustering models produced with sCSC are dependent on the adjacency definition supplied by the user. Because the adjacency matrix summarizes pairwise dissimilarities only, this has the benefit of not requiring adherence to the triangle inequality 33 -in fact, the data points need not exist in a well-defined space at all. However, with this flexibility comes the drawback that a poor dissimilarity metric may obscure patterns in the data. The dissimilarity measures used in this study have been shown to be effective in previous studies 14,28 , and in general may require domain-specific knowledge to determine for a given dataset.
In the next two sections, we apply sCSC to benchmark problems in fluid dynamics in order to demonstrate its effectiveness in identifying coherent structures in the absence of a priori assumptions. Then, we demonstrate the use of sCSC to determine the number and shape of flow structures involved in vortex ring entrainment using data obtained from empirical measurements the laboratory. Finally, to highlight the interpretability of the sCSC dendrogram for high-dimensional datasets, we use sCSC to visualize an interpretable representation of an atomistic protein folding simulation. Finally, we discuss the relationship of this method to other unsupervised clustering methods, and the possibility of extending sCSC beyond physical dynamical systems.

Quadruple-eddy simulated ocean flow
A key challenge in geophysical fluid dynamics is to extract and characterize coherent fluid motions from sparsely sampled turbulent flows of air or water. The coherent structures, often manifested as eddies and jets, can dominate the transport of heat, salt, nutrients, and pollutants 34,35 . Therefore, they can serve as the basis for low-order models that capture the salient physics 36 , or as a template for data assimilation into large-scale weather forecasting models 37 . Turbulent flow structures in the ocean also impact the behavior and ecology of marine life 38 .
Distributed sensor networks such as the Argo collection of 3800 ocean drifters 39 sample the flow field in a Lagrangian sense, recording the properties of the water as each drifter is carried by the prevailing currents. Here we demonstrate the capability of the sCSC method to extract coherent fluid structures from such collections of Lagrangian measurements.
To do so, we first apply sCSC to a common model of Lagrangian ocean drifters in a simplified flow field comprising only four eddies, the unsteady quadruple-eddy flow 3,11 . While this model represents a simplification of the full physics, it is valuable due to its common use for These drifters illustrate qualitatively different trajectories in the flow, including those that switch quadrants (dots), those that remain in a single eddy core (triangles), and those that spiral radially between the center and the boundary of a quadrant (squares). In panel (B), the initial positions of 3000 randomly initialized drifters are colored according to their initial quadrant in the flow. The drifters maintain their color assignment in panel (C), showing how the unsteady eddy motion leads to mixing of the drifters after the 4 periods of horizontal oscillation. The east-west oscillation of the eddy field leads horizontal mixing of the flow. The resulting sCSC dendrogram is shown in panel (D), with every position occupied by all 3000 drifters plotted in black dots in the corresponding inset branch plot (note that drifter positions often appear as continuous black patches due to the high density of overlapping positions occupied by the drifters.) The width of each branch is proportional to the fraction of the states that it contains. The corresponding binary code of each branch is labeled in black text, and the number of trajectories associated with each node is labeled in red text. The dendrogram is plotted to the seventh eigenvector, although labels below the fourth eigenvector are omitted for clarity. The horizontal and vertical axes are measured in units of the parameter z, and the branches are plotted at 45-degree angles. We have visualized the first 7 eigenvectors for brevity of presentation.
the evaluation and comparison with existing methods to identify coherent structures 3,11,14,15 .
As shown in Fig 2A, drifter trajectories within the two eddies at the upper-left and lower-right rotate clockwise, whereas trajectories within the other two eddies rotate counter-clockwise. Simultaneous with this rotation, an east-west oscillation of the eddy field occurs, which causes exchange of drifters between the eastern and western eddies. This exchange, which depends on the location and timing of the drifter release relative to the east-west oscillation cycle, is illustrated in the transition from initial drifter positions in Fig 2B to  show the incoherent region. The latter two branches also display an imbalance between the top and bottom regions of the flow. (B) Three selected branches from the Bickley jet analysis are visualized by plotting the particle trajectories belonging to a designated coherent structure (gray) and their starting positions (black). Branch 0 shows the eddy cores, which do not mix. Branch 10 contains the meandering jet, and branch 11 accounts for the incoherent surroundings. For both sets of plots, the starting positions show that the particles belonging to coherent structures at a given time point (in this case, the starting point) are more compactly located than the total space explored by their trajectories over time. A particle found at a given instant in the space that is common among different coherent structures can therefore not be attributed to a coherent structure based on that time point alone. Visually equivalent results can be produced from other time points.
dynamic system, and the pairwise dissimilarity between each of the states is given by the standard deviation of the instantaneous distance between drifter positions at time t k , r ij (t k ), divided by the average distance between each pair of drifters, r ij , for T total time points 14 : This measure anticipates that coherent structures will comprise drifters whose relative positions do not vary as the flow evolves, leading to a small values of the pairwise dissimilarity measure (i.e. a small standard deviation) within each cluster. By contrast, pairs of drifters that straddle the boundary between coherent structures can experience exponential separation over time and a correspondingly large standard deviation of their instantaneous separation 26 .
Without requiring the specification of the number of eddies, the sCSC method reveals a clear, physically interpretable structure for this complex flow (Fig 2D). The primary bifurcation of the flow is between trajectories that remain in the eddy cores of their original quadrant (branch 0) and trajectories that do not (branch 1). The trajectories of branch 0 are then further subdivided into trajectories that remain within eddy cores in the northern half of the flow (branch 00) and those that remain within eddy cores in the southern half of the flow (branch 01), reflecting the absence of north-south drifter exchange. Finally, the trajectories associated with the individual quadrants are identified at the level of the third bifurcation (e.g. branch 000 shown in Fig 2D inset, as well as branches 001, 010, and 011 for the other three individual quadrants, not shown in inset). An additional visualization of the major coherent structures identified-namely, branches 00, 01, 10, and 11 in Fig 2D-is presented in Fig 3A. Whereas the application of k-means clustering or other conventional tools would require a priori guidance to determine that four independent structures exist in branch 0 (i.e. one eddy per quadrant) 12 , this result is revealed naturally by the sCSC dendrogram, as further bifurcations after branch 000 do not produce additional coherent states; all of the trajectories that remain together after the third bifurcation remain together after subsequent bifurcation.
To be sure, the presence of the four eddy cores can also be revealed by a contour map of the largest finite-time Lyapunov exponent (FTLE) corresponding to the quadruple-eddy velocity field (see Figs. 2 and 4 in Ref. 14). The key advantage of the sCSC approach is that a similar result can be achieved with two orders-ofmagnitude less data: Schlueter-Kuck and Dabiri showed in Ref. 14 that the FTLE gradient calculation is wellposed when the number of drifters is on the order of 10 5 By contrast, the same cores can be identified by as few as 300 drifters using the present method, and the cores can be identified as long as drifters are present in the cores over timescales longer than the eddy turnover time.
The structure of branch 1 is less well organized and reflects the chaotic advection of trajectories that spiral radially within a quadrant and/or switch quadrants in the unsteady flow. Nonetheless, the dendrogram structure does indicate geometric symmetries within the chaotic motions, such as a preference for three quadrants among the trajectories in branches 110 and 111; and a more constrained preference for two quadrants exists at branch 1110. A general observation is that geometric symmetries appear as balanced dendrogram bifurcations. This is in contrast to the structure of random noise, which is characterized by a trivial sCSC dendrogram with a single branch that contains nearly all of the states and a splintering of a small number of fully-converged states at each level of the dendrogram (see Fig. S1).

Bickley jet simulated atmospheric flow
A more complex geophysical flow model is the Bickley jet, which serves as a common model for zonal jets in the atmosphere 40 . This flow is composed of a central meandering jet as well as flanking eddies that are periodic along the east-west axis (Fig 4A). The sCSC dendrogram corresponding to this flow (for the same dissimilarity measure as the quadruple-eddy flow, equation (3)) is similarly effective in extracting the salient coherent features (Fig 4D). The flanking eddies are identified in branch 0.
However, a key difference from the previous quadrupleeddy example is that the individual eddies are largely indistinguishable from one another. This result reflects the homogeneity of fluid dynamics within the flanking eddies, which was not present among the trajectories in the quadruple-eddy flow (contrast e.g. Figs. 2C and 4C). A notable exception is the eddy located at the meridional axis of symmetry, i.e. branch 011. An additional analysis tracking the distance of particles from the eddy cores showed that fewer particles belonging to this center eddy travel past a given contour threshold during the simulated time-series than particles from other eddies. Branch 1 of the Bickley jet dendrogram collects those trajectories that are not associated with the flanking eddies. A subset of those trajectories, namely branch 10, is the meandering zonal jet. The remaining trajectories (branch 11) form a chaotic background flow that is robust to further bifurcation. These three coherent structures are further visualized in Fig 3B. The sCSC structure of both of these simulated geophysical flows can be exploited to create low-order models of the governing fluid transport processes, without the need for ad hoc assumptions regarding the number of coherent structures present. Because similar results can be achieved despite significant missing or noisy data (see Ref. 14), the inherently limited data collection that can be achieved in the ocean and atmosphere can be more effectively leveraged to potentially improve the accuracy of weather forecasting, for example 37 . Hence, the sCSC method can be a powerful tool for both very large and very sparse datasets.

Empirical measurement of vortex ring formation and entrainment
Vortex ring formation is a prominent phenomenon in engineered and biological systems as diverse as aerodynamic flow control, animal swimming, and the human cardiovascular system [41][42][43] . The growth and dynamics of vortex rings are dictated by the extent to which they entrain surrounding fluid 44 . Moreover, knowledge of the precise region of the flow that is ultimately entrained by a forming vortex ring can be used to predict how a vortex delivers mass, momentum, and energy to the surrounding flow. For example, pathological vortex ring formation in the human left ventricle has been shown to provide an effective diagnostic of heart failure 43 . Despite the importance of vortex ring entrainment, methods to quantify the region of the flow impacted by vortex rings have shown limited success, particularly in cases for which the FTLE field cannot be calculated due to the sparsity of measurements. Here, we demonstrate the ability of the sCSC technique to precisely identify the region of a flow that is entrained by a forming vortex ring-knowledge that has been previously inaccessible in cases where measurement data is sparse, such as when the flow is interrogated using non-invasive clinical methods such as ultrasound or magnetic resonance imaging.
Vortex rings were formed in the laboratory using a piston-cylinder apparatus described in previous work 45 . A motor-driven piston pushes water through a vertical hollow cylinder of diameter D = 2.49 cm that is submerged in a tank with cross-sectional area of 61 cm by 61 cm and height of 91 cm. As the flow exits the cylinder at a nominal speed of 7 cm s −1 , the fluid boundary layer at the inner surface of the cylinder rolls up into a toroidal vortex ring, which propagates away from the cylinder via self-induction.
A set of 1174 fluid particle trajectories in the domain encountered by the vortex ring were analyzed using the present sCSC method and the dissimilarity measure in equation (3) to identify regions of the ambient flow that , with every position occupied by all 3000 particles plotted in black dots in the corresponding inset branch plot (note that particle positions often appear as continuous black patches due to the high density of overlapping positions occupied by the particles). The width of each branch is proportional to the fraction of the states that it contains. The corresponding binary code of each branch is labeled in black text, and the number of trajectories associated with each node is labeled in red text. The dendrogram is plotted to the seventh eigenvector, although many of the labels are omitted for clarity. The horizontal and vertical axes are measured in units of the parameter z, and the branches are plotted at 45-degree angles. We have visualized the first 7 eigenvectors for brevity of presentation.
were entrained by the vortex ring. As illustrated in Fig. 5A, it is impossible to determine which fluid particles have been entrained by the vortex ring based on visual inspection of the trajectories alone. A comparison FTLE analysis performed by Schlueter-Kuck and Dabiri on 30,500 advected particles (see Ref. 14, Fig. 9) showed that 1174 trajectories are not sufficiently close to one another to compute the FTLE field, because the required gradient calculations are not well-posed for sparse trajectories. The alternative use of existing techniques based on heuristics, such as k-means or the spectral eigengap, rely on knowledge of the number of eddies to guide clustering; in the present case, it is not known a priori how many structures comprise the flow.
The sCSC dendrogram (Fig. 5D)  ambiguously identifies the fluid particles entrained by the vortex ring as those belonging to branch 0. Branch 1 identifies all other particles and further bifurcations of that branch reveal underlying geometric symmetries, as in Branch 1 of the quadruple-eddy flow in Fig. 2D.
Plots of the initial and final positions of the fluid particles in Fig. 5B and C show that the fluid entrained by the vortex ring occupies a well-defined region in the immediate path of the vortex ring, a result that is consistent with intuition but that can now be characterized quantitatively for the first time. The void created by the evolution of the blue particles from Fig. 5B to Fig. 5C is filled by the fluid ejected from the cylinder. The entrained blue particles ultimately occupy positions around the vortex ring that are consistent with the FTLE analysis in Ref. 14. This provides another demonstration of the interpretability of the sCSC results: notably, these result have been achieved without any of the ad hoc assumptions required by existing methods of entrainment quantification 44,46 .

VISUALIZING MACROSTATE MODELING OF MOLECULAR DYNAMICS
In this section we highlight the interpretability of the sCSC dendrogram for a high-dimensional dynamical dataset. Specifically, we focus on an atomistic simula-tion of protein folding. Whereas fluid dynamics datasets typical represent only a few spatiotemporal coordinates, atomistic molecular dynamics (MD) datasets can contain thousands of degrees of freedom with complex interrelationships.
While MD is resource-intensive, advances in simulation parameters 47 , bespoke hardware 48 , and distributed computing frameworks 49 , have enabled MD analyses to yield insight into complex biophysical systems at biologically meaningful timescales 50 . Thus, these simulations have the potential to uncover biophysical phenomena such as the misfolding mechanisms involved in a variety of diseases, stable configurations yet undiscovered by crystallography, and small molecule binding sites and kinetics for drug discovery.
However, without complementary analysis methods designed to communicate statistically rigorous and understandable conclusions resulting from such computational experiments, the benefits of advances in MD cannot be fully realized. While many methods have been developed to perform these analyses 50 , it remains a challenge to display their results in a meaningful way. sCSC can be used to augment already-existing methods for analyzing MD simulations such that the results can be visualized and interpreted.
To demonstrate the use of sCSC to visualize an MD analysis, we use an ultralong MD simulation performed by Lindorff-Larsen et al. 51 of the folding and unfolding of Protein G, a 56-residue protein expressed in streptococcal bacteria. The simulation details are described in the Supporting Materials of Ref. 51. We use a Markov state model (MSM) analysis to define the states of the system, which is a discrete approximation to the Perron-Frobenius operator 20 . This established mathematical framework codifies the system using a kinetic master equation 50 . The master equation takes the form of a stochastic transition probability matrix, in which each state of the system is identified by a probability distribution of transitioning to every other state.
After constructing a quantitatively accurate and optimized MSM 22 , we are interested in clustering these state into a smaller number of interpretable "macrostates", since it is conceptually difficult to describe hundreds unique states in a physically interpretable way 52 .
For our MSM, we found that 175 states optimally describes the system according to a variational evaluation (the MSM construction protocol is consistent with current best practices and is described in detail in the Methods). Minimum variance clustering analysis (MVCA), an effective coarse-graining method for MSMs, has recently been developed by one of the authors and uses a pairwise information theoretic dissimilarity metric in order to group states into a smaller number of macrostates, namely, the Jensen-Shannon divergence between the probability distributions characterizing the rows of the MSM transition probability matrix 28 (see also Methods equation (4)).
By using the same pairwise dissimilarity metric as MVCA, the 175 2 state adjacencies can be input into the sCSC algorithm to produce a visualization of a set of macrostates in the protein folding dataset, which is displayed in Fig. 6. Nine branches of the sCSC dendrogram are depicted by sampling one structure from each original MSM state contained in that branch. Since the nine depicted branches contain all 175 original MSM states, these branches can be interpreted as a possible set of system macrostates. By superimposing a representative conformation from each MSM state and coloring the protein according to its secondary structure, we can visualize the MD trajectory by interpreting the sCSC groupings.
First, we note that the folded structure (branch 0) is identified in the first sCSC solution and is separated from the denatured, unfolded ensemble, which comprises the rest of the dendrogram (branch 1). We see that the folded branch isolates a well-defined conformation with low variance across sampled conformations. The incorporation of subsequent sCSC eigenvectors identifies groups of structures unified by their protein secondary structure features. Various branches contain similar secondary structure elements (similar colors in the structure visualization in Fig. 6), elucidating substructures exhibited during the folding of Protein G. For example, branch 1110 contains β-sheet secondary structure (yellow), whereas branch 11110 contains noticeable α-helical secondary structure (pink). Branch 1111111 is the least coherent, containing the most unstructured states. Summary statistics for each macrostate can be found in Table S1.
In this example, we have chosen to highlight secondary structure changes so we can understand which secondary structure elements characterize different subprocesses within folding. We see that the yellow β-sheet secondary structure appears in several macrostatesoften along with the blue 3 10 -helix, thought to be an intermediate structure during α-helix formation 53 -which might indicate that the formation of the pink α-helix is a rate-limiting step in the folding process. However, we could also choose to quantify and visualize macrostate contact maps, radii of gyration, or distance to various structures in order to gain complementary insight into the folding system.
For other dynamical processes characteristic of proteins, such as conformational change, allostery, and drug binding, we might choose to visualize parameters related to specific sites of interest or observables that can be probed experimentally. The choice of how to describe the macrostates is independent of the clustering process, but the depictions or statistics that enable the best interpretation of the system will depend on the dynamics of interest.
The sCSC dendrogram analysis offers advantages in the interpretation of high-dimensional MD datasets after an adequate kinetic model has been constructed. Utilizing the pairwise state adjacency from this kinetic model for an sCSC analysis produces a hierarchical representation of structural motifs according to the extent of their dissimilarity, which provides insight into the protein conformations that characterize subprocesses within folding. As in the fluid dynamics examples in the previous section, truncating the dendrogram when bifurcations are unoccupied produces an objective way to visualize the converged clusters. Finally, the generation of orthogonal sCSC solutions enables orthogonal dynamical processes in the protein folding simulation to be incorporated in analogy to the simple model in Fig 1. We anticipate that this type of interpretable visualization will be useful for communicating the results obtained from highdimensional datasets.

DISCUSSION
The present approach addresses the previously stated challenges with common clustering algorithms: it does not require a choice of cluster number or dendrogram cutting, it leverages the concept of dissimilarity in a computationally tractable way, and it maintains an interpretable hierarchical relationship among splittings.
Perhaps the most important advantage of this approach relative to commonly used tools is that the number, shape, and size of clusters in the data emerges naturally from the sCSC dendrogram rather than being specified a priori. As the set of eigenvectors that is included in the analysis is increased to include those associated with lesser eigenvalues, the number of unoccupied binary codes generally increases. This is because progressively fewer groups of states that survived the preceding orthogonal partitions will be subsequently separated at lower levels of the dendrogram. In this way, the set of clusters in the dataset is revealed to be those branches of the dendrogram structure whose shape converges as the number of eigenvectors included in the analysis increases. The sCSC dendrogram indicates not only the number of these converged clusters but also the relative strength of the partitions between clusters, via the length of the connecting branches in z-space.
While sCSC conceptually resembles divisive hierarchical clustering, the number of possible divisions in the latter scales as 2 c−1 − 1 with the number of clusters c, which is generally not feasible for large c unless the initial dataset is sparse 54 . sCSC scales in the same way as agglomerative hierarchical clustering, requiring a computationally nontrivial but tractable calculation of n 2 dissimilarity values for n initial data points. However, unlike agglomerative clustering-which also requires the calculation of n 2 dissimilarities-small differences between states at lower levels of the sCSC dendrogram have no impact on the clusters that form at higher levels, as the top-down approach begins by using the most significant partitions indicated by the eigenvectors associated with the largest eigenvalues.
When applying sCSC, domain knowledge should inform selection of an appropriate dissimilarity measure, but ad hoc and a priori assumptions about the structure of the data itself are not needed. While we have demonstrated sCSC only for simulated physical systems, we anticipate that these features will make sCSC a powerful tool for interrogating both new and longstanding research problems, including those in fields where the underlying processes are less accessible, such as genomics and neuroscience. For example, genetic ancestries can potentially be clustered on the basis of the sCSC structure that emerges from the dissimilarity of single-nucleotide polymorphisms (SNPs) among individuals within a population. In the latter case, differences in neuronal activation can be amplified using sCSC to identify emergent functions that involve coordination of spatially distant neurons. These and other applications can be pursued immediately given the tools developed here.

Markov state models
Markov state models (MSMs) are a kinetic master equation framework for describing and analyzing timeseries data such as molecular dynamics (MD) simulations by approximating the continuous Perron-Frobenius operator using a discrete transition probability matrix 20 . A MSM requires partitioning the phase space explored by a system into discrete states (henceforth "microstates"), and is represented by a transition probability matrix defined for a Markovian lag time τ at which transitions between the microstates are independent of the history of the system. For protein folding analyses, phase space (positions and velocities) is conventionally approximated by conformation space (positions), and states are chosen according to an objective optimization protocol, in this case a variational principle 22 . The Markovian lag time chosen to analyze a system must be long enough for memoryless inter-state transitions, but short enough to resolve dynamics; for protein folding dynamics, lag times on the order of tens of nanoseconds are typical.
The MSM transition probability matrix is constrained to be stochastic, symmetric with respect to a stationary distribution, ergodic, and aperiodic. It can thus be decomposed into eigenvalues and eigenvectors, T (τ )λ = ψλ, where the eigenvalues are on the unit interval |λ i | ≤ 1 and the highest eigenvalue λ 1 = 1 is unique. The variational principle states that the sum of estimated eigenvalues is bounded from above by the sum of true eigenvalues; thus many state decompositions can be tested according to the sum of a set number of eigenvalues for a set Markovian lag time and the state decomposition resulting in the highest sum of approximated eigenvalues can be chosen for further analysis.
The MSM for the simulation analyzed in this work was constructed according to the protocol used in Ref. 55 for a set lag time of 50 ns according to a previous analysis for the same system performed in Ref. 56. First, the Cartesian coordinates from the raw simulation data are transformed into the sines and cosines of the φ and ψ side chain dihedral angles for each amino acid of the protein. Next, the vector of dihedrals is again transformed using time structure-based independent component analysis (tICA) 57 with a tICA lag time of 50 ns wherein each tICA solution vector was weighted according to its associated eigenvalue 58 . Then, mini-batch k-means was used to cluster the simulation frames according to their weighted tICA representations for 265 different numbers of cluster centers randomly chosen between 10 and 5000. Finally, a MSM was constructed on each k-means state decomposition in which the transition probability matrix is obtained using a maximum likelihood estimator of the data such that detailed balance is achieved. For each model, five MSMs were fit to a randomly chosen half of the data and then applied to the other half of the data, and the latter was used to sum the first 50 MSM eigenvalues as that model's score. The winning model was chosen to be the one that achieved the single maximum score from parameter sets with mean scores within one standard deviation of the maximum mean score. For our analysis of 265 different microstate numbers, the best model according to this variational analysis had 175 microstates and was used for analysis in the main text.

Coarse-graining MSMs with MVCA
Minimum variance clustering analysis (MVCA) was recently published by one of the authors as an algorithm for coarse-graining an MSM transition probability matrix into a smaller number of macrostates by grouping the original microstates 28 . MVCA achieves a coarsegrained model by using agglomerative hierarchical clustering with Ward's minimum variance method 59 to cluster the microstates, where the pairwise dissimilarity between microstates is quantified using an information theoretic measure between the probability distribution char-acterized by the corresponding row of the MSM transition matrix.
If two microstates are defined by transition probability distributions P and Q, their pairwise dissimilarity can be written using the Jensen-Shannon divergence 27 , div JS (P ||Q) = 1 2 where M is the elementwise mean of P and Q, and each term is the Kullback-Leibler divergence to the mean. We quantify the dissimilarity between microstates using the square root of equation 4 28,60 . From this set of pairwise similarities, MVCA goes on to cluster the microstates using agglomerative hierarchical clustering with Ward's method. In the analysis presented in this work, the set of pairwise dissimilarities is instead used to construct the adjacency matrix for sCSC.

DATA AVAILABILITY
The three fluid mechanics datasets and all adjacency matrices used to create the models in this work are available on github at https://github.com/ brookehus/sCSC. This repository also contains example MATLAB and Python codes, including Jupyter notebook tutorials. The all-atom molecular dynamics simulations of Protein G were previously published in Ref. 51, and the trajectories are available at no cost for non-commercial use through contacting trajectories@deshawresearch.com.
FIG. S1. sCSC dendrogram applied to random data. To evaluate the sCSC dendrogram structure resulting from random noise, an adjacency matrix was constructed based on 3000 two-dimensional trajectories whose instantaneous positions over 2000 time steps were selected randomly from uniform distributions over the spatial coordinate intervals x = (0, 1), y = (0, 1). These states were analyzed using pairwise dissimilarity based on the normalized standard deviation. The result is a single main branch with a small splintering of trajectories at each eigenvector level. The splintering at each level converges throughout the seven eigenvectors included in the analysis. The width of each branch is proportional to the fraction of the states that it contains. The corresponding binary code of each branch is labeled in black text, and the number of trajectories associated with each node is labeled in red text.