Topological Data Analysis of Biological Aggregation Models

We apply tools from topological data analysis to two mathematical models inspired by biological aggregations such as bird flocks, fish schools, and insect swarms. Our data consists of numerical simulation output from the models of Vicsek and D'Orsogna. These models are dynamical systems describing the movement of agents who interact via alignment, attraction, and/or repulsion. Each simulation time frame is a point cloud in position-velocity space. We analyze the topological structure of these point clouds, interpreting the persistent homology by calculating the first few Betti numbers. These Betti numbers count connected components, topological circles, and trapped volumes present in the data. To interpret our results, we introduce a visualization that displays Betti numbers over simulation time and topological persistence scale. We compare our topological results to order parameters typically used to quantify the global behavior of aggregations, such as polarization and angular momentum. The topological calculations reveal events and structure not captured by the order parameters.

forces in the model are social forces between two fish-namely attraction and repulsion-and are described by a simple functional form dependent on the distance between individuals, akin to gravitational or intermolecular forces in physics. Since then, hundreds of aggregation models have been created; some of the most well-studied include [6,11,12]. The common approach is to envision organisms as point particles with motion laws that are first or second order in time, and with behavioral rules that are some combination of self-propulsion and/or social alignment, attraction, and repulsion. An alternative modeling approach, also the subject of a rich literature, is to treat a sufficiently large population as a continuum and describe its dynamics with a partial integrodifferential equation as in, e.g., [13][14][15][16][17]. The discrete and continuum models that we have mentioned here are largely phenomenological. They are minimal models inspired to greater or lesser degree by biological observation, and their minimalism allows one to understand the basic behaviors necessary or sufficient for a particular type of aggregation phenomenon.
Quantitative understanding of aggregations has also been developed through the exploration and modeling of rich data sets measured in the field or in experiment. This type of study has a much more recent history, as the technology necessary to gather and process large, accurate data sets did not exist several decades ago. Notable examples include data-based modeling of starlings [18], ducks [19], aphids [20], golden shiner fish [21], and desert locusts [22]. In data based studies, one typically collects time series of organisms' positions, and possibly, velocities. One may use this data in two different ways: to infer the rules for motion that each individual follows and/or to characterize the collective dynamics of the group.
In a classical approach to characterizing collective dynamics, one begins with N organisms' positions x i and velocities v i , either from biological observation or numerical simulation of a model. One then calculates some global metric hoped to give insight into macroscopic dynamics. For instance, [6] simulates discrete swarmers who interact via attraction, repulsion, and alignment, and measures the group polarization P and angular momentum M ang , where r i = x i − x cm and x cm is the center of mass of the group. By varying parameters that control the social interactions and plotting P and M ang , [6] identifies different regimes of behavior, including swarming, motion on a torus, and a highly parallel (polarized) group. A different aggregation model, from [12], can produce a rotating annulus of individuals (a "single mill") or superposed, counterrotating annuli (a "double mill"). Metrics P and M ang cannot distinguish a single from a double mill, so [12] introduces the absolute angular momentum whose numerator differs from M ang . Together, M ang and M abs can distinguish single from double mills. Other examples of metrics include the average number of neighbors with whom an individual interacts (which requires knowledge of interaction rules) and the mean distance to nearest neighbor [23].
Our discussion here provides a sampling of metrics in the literature. Most are inspired by order parameters from physics, and many have been constructed a posteriori, based on knowledge of the dynamic whose detection was desired. In our present work, we explore whether topology offers a natural way to characterize collective behavior. In brief, we use the methods of topological data analysis to compute the persistent homology of spatiotemporal aggregation data sets arising from numerical simulation of models. We introduce a new visualization to explore homological persistence over spatial scales and over time. As we will explain at length, this visualization displays a data point cloud's Betti numbers in a contour diagram. We refer to this plot as the Contour Realization Of Computed k-dimensional hole Evolution in the Rips complex (CROCKER). We show that our topological analysis reveals dynamical events not captured in time series of the classically studied order parameters. Our primary goal is to demonstrate the utility of topological data analysis for biological aggregations and similar applications.
We note that the term topology has been invoked in the aggregation literature in a different sense than we use it here. In both the biological modeling literature and the robotics literature, "topology" sometimes refers to the coupling scheme between agents, that is, by which members of the group a given individual is influenced [18,24]. For instance, in some models or algorithms, a given agent might be influenced by the closest agent, or the closest two agents, or the closest few agents within an agent's field of view. In this sense, topology is an input of the model. In a different sense, we are using topological tools to analyze the outputs of a model. The rest of this paper is organized as follows. We begin with an overview of persistent homology. This discussion aims to present some of the key concepts to a mathematical reader unfamiliar with algebraic topology. Brief descriptions of computational methods and our data visualization follow. Then, we proceed to topological analyses of the models of Vicsek, et al. [11] and D'Orsogna, et al. [25] before concluding.

Topological Data Analysis and Persistent Homology
Homology is a tool from algebraic topology that measures the features of a topological space such as an annulus, sphere, torus, or more complicated surface or manifold. In particular, homology can distinguish these spaces from one another by quantifying their connected components, topological circles, trapped volumes, and so forth. A finite set of data points can be viewed as a (noisy) sampling from an underlying topological space. One can measure the homology of the data by creating connections between proximate data points, varying the scale over which these connections are made, and looking for features that persist across scales. This is called persistent homology. Persistent homology has been used in a wide array of applications to uncover the topological structure of data, including neuroscience, language processing, natural images, signal analysis, bioinformatics, computer vision, and sensor networks [26][27][28][29][30][31][32].
We explain these ideas in greater detail for the remainder of this section. Our discussions in the first two subsections below recapitulate presentations in texts such as [33,34].

Forming a Simplicial Complex
To build a global object from a discrete set of N data points, we construct a simplicial complex S. A simplicial complex is a set consisting of a finite collection of k-simplices (simple pieces), where a 0-simplex is a vertex, a 1-simplex is an edge, a 2-simplex is a triangle, a 3-simplex is a tetrahedon, and so on; see Fig 1. These simplices satisfy two properties. First, for every set σ in S, every non-empty subset τ σ also belongs in S. For instance, if tetrahedron abcd is in S, then the triangles abc, abd, acd, bcd, the edges ab, ac, ab and the vertices a, b, c, d are also in S. Second, two k-simplices are either disjoint or they intersect in a lower dimensional simplex.
To form k-simplices, we use the Vietoris-Rips complex, sometimes simply called the Rips complex. To build this complex, one first defines a distance metric which can be realized as a symmetric N × N matrix of pairwise distances between points. For each ε > 0, called the proximity parameter, we construct a simplicial complex S ε in the following way. In S ε , every collection of k+1 data points is a k-simplex if the pairwise distance between points is less than ε.
Thus, the 0-simplices are the data points themselves. A 1-simplex (an edge) is formed whenever two points are within ε of one another. A 2-simplex (a triangle) is formed whenever three points are pairwise within ε of one another; this occurs when there is a 3-cycle in the underlying graph formed by the vertices and edges in S ε (this graph is sometimes called the 1-skeleton of S ε ). A 3-simplex (a tetrahedron) is formed whenever four points are pairwise within ε of one another. See Fig 2, in which the yellow circles represent ε/2 balls so that two vertices are connected by an edge if their ε/2-balls intersect.
For purposes of homology, it is necessary to impose an orientation on the vertices of each ksimplex. A k-simplex [v 0 , v 1 , . . ., v k ] of k+1 data points is ordered such that if elements are permuted with an odd permutation, the order is negated. Thus, There are other methods one could use to form a simplicial complex. The Rips complex is the flag or clique complex that is the maximal simplicial complex built from the underlying  graph. Another commonly used method is the Cech complex, where k-simplices are formed if the ε/2-balls around k+1 vertices have a nonempty intersection. We use the Rips complex since it is more computationally tractable than the Cech complex; see the discussion in Section 1.3 of [35]. The witness complex is another method to form a simplicial complex. It is particularly useful with large data sets as it subsamples the data in a way that uses information from the entire data set; see [36].

Homology
Homology is a way to uncover k-dimensional "holes" in a simplicial complex. This requires imposing an algebraic structure on the simplicial complex S ε . For each k ! 0, we create an abstract vector space C k with basis consisting of the set of k-simplices in S ε , so that the dimension of C k equals the number of k-simplices. The elements of C k are called k-chains. In practical computations, the coefficients of this vector space come from a finite field Z p (integers modulo p) for a small prime p. These vector spaces consist of all formal linear combinations c = ∑ i a i σ i , where a i 2 Z p and the sum is over all k-simplices σ i in S ε .
To compute homology, one must be able to describe the boundary of a k-simplex algebraically. The boundary of a k-simplex σ is the union of the (k − 1)-subsimplices τ σ. For each k ! 1, the boundary map @ k : where ½v 0 ; . . . ;v i ; . . . ; v k is the (k − 1)-simplex obtained from [v 0 , . . ., v k ] by removing the vertex v i . For example, Observe that for a k-simplex σ, @ k (σ) is an algebraic representation of its boundary. For instance, in the second example above, [v 0 , v 1 , v 2 ] represents a triangle and [v 1 , are the oriented edges that form its boundary. Boundary operators connect the vector spaces C k into a chain complex, Consider the following two subspaces of C k , which are determined by the kernel and the image of the boundary operators, kÀcycles : The boundary operator satisfies the following fundamental property: @ k @ k+1 = 0. That is, "a boundary has no boundary". For example, It then follows that B k is a subspace of Z k . Thus, C k is the vector space of all k-chains in the simplicial complex S ε , Z k is the subspace of C k consisting of k-chains that are also k-cycles, and B k is the subspace of Z k consisting of k-cycles that are also k-boundaries.
The goal of homology is to "discard" cycles that are also boundaries. To this end, we put an equivalence relation on Z k as follows. Two cycles z 1 ; z 2 2 Z k are homologous (equivalent), written z 1 * z 2 , if they differ by a boundary, i.e., z 1 À z 2 2 B k . Consider the example in  4 , v 1 ] are cycles because @ 1 (b) = @ 1 (r) = 0. These cycles are homologous because their difference is a (green) boundary, The equivalence relation * defined above partitions the k-cycles Z k into a union of disjoint subsets, called homology classes. We let [z] denote the homology class of z 2 Z k and define the kth homology of S ε as the set of homology classes Algebraically, H k ¼ Z k =B k , a quotient of vector spaces. The kth Betti number, b k , is defined as the dimension In terms of boundary operators, b k ¼ ½n k À rankð@ k Þ À rankð@ kþ1 Þ, where n k is the dimension of the vector space C k . In terms of the topological characteristics one might hope to measure, b k equals the number of independent holes of dimension k, and this is the key point for our analysis later in this paper. For instance, b 0 is the number of connected components, b 1 is the number of topological circles, b 2 is the number of trapped volumes, and so on. The topology of a simplicial complex may be described by the sequence of Betti numbers, b ¼ ðb 0 ; b 1 ; b 2 ; . . .Þ. For instance, a topological circle has b ¼ ð1; 1; 0; . . .Þ, a topological torus has b ¼ ð1; 2; 1; 0; . . .Þ, and a topological sphere has b ¼ ð1; 0; 1; 0; . . .Þ. Betti numbers are a topological invariant, meaning that topologically equivalent spaces have the same Betti number. See, e.g., [33,34] for these and other examples.

Persistence
Given a collection of N data points, the resulting Rips complex and its homology are highly dependent on the choice of proximity parameter ε. For small values of ε, the simplicial complex consists of isolated vertices. At the largest value of ε shown, the entire data set is a single connected component. As ε changes, other topological events occur which we will describe momentarily. A natural question is what is the optimal ε to use for any data set. This can hardly be selected without a priori knowledge of the underlying space.
To reconcile this ambiguity, one exploits the fact that as ε grows so do the Rips complexes, giving an inclusion of complexes for small ε into those for larger values. That is, if ε 1 ε 2 This sequence is called a filtration. "Persistent" homology, then, tracks topological features which persist across a range of values of ε. Those features which persist over a large range are considered signals of the underlying topology, while the short lived features are taken to be noise inherent in approximating a topological space with a finite sample. Foundational papers on this methodology include [35,37,38]. A convenient way to visualize persistent homology is through a graphical representation called a barcode. There is a distinct barcode for each homology space H k from which we infer the Betti number b k . As an example, see Fig 4. The horizontal axis corresponds to the proximity parameter ε, and the vertical axis is an (arbitrary) ordering of the homology generators, i.e., the distinct homology classes of dimension k. Each homology class is visualized by a bar that persists for a given range of ε. Its leftmost endpoint is at the ε value at which the homology class forms, and its rightmost endpoint is the ε value at which it disappears. At any given ε, the Betti number b k ðεÞ is the number of bars that intersect the vertical line through ε. Those bars which persist over longer intervals generally correspond to real topological features, whereas short bars are considered noise. In the figure, we see the following sequence of Betti numbers. For ε = 1.5, b ¼ ð18; 0; . . .Þ because there are no connections amongst the 18 vertices. For ε = 5, b ¼ ð11; 0; . . .Þ, reflecting the fact that some vertices have joined into larger connected components. For ε = 7, b ¼ ð4; 1; 0; . . .Þ, reflecting even further joining of components as well as the formation of one topological circle. Finally, for ε = 9.5, b ¼ ð1; 2; 0; . . .Þ, meaning that all of the data has joined into one connected component that contains two topological circles. Another method of displaying homological information is through a persistence diagram. Each bar in the barcode is represented in the Cartesian plane with the horizontal and vertical axes encoding the leftmost and rightmost ε values of the bar. Points near the diagonal are inferred to be noise while points further from the diagonal are considered topological signal. Fig  5(D) shows the persistence diagram corresponding to two barcodes in panels (B) and (C). This data is generated by a biological aggregation model which we discuss later.
Because our data sets are obtained from numerical simulation, they are noiseless. Still, one might wonder whether small perturbations of data would impact the topological features that are measured. For biological aggregations, this question would be especially relevant in analyzing experimental data, for which measurement error might introduce noise. As shown in [39], small perturbations of data result in small perturbations of persistence diagrams, indicating stability of topological features.

Computational Methods and Data Visualization
In the next two sections, we will present topological analyses of simulation output from the two aggregation models of [11] and [25]. We perform the simulations in MATLAB. For the Vicsek model [11], as we describe in more detail in the next section, the simulation output consists of the two dimensional position and the angular heading of each agent in a group of interacting agents. The physical domain is a square with sides of length ℓ and periodic boundary conditions. Heading is defined on [0, 2π). To avoid issues arising from disparities between the position coordinates and the heading coordinate, we rescale position coordinates, multiplying them by 2π/ℓ. For the D'Orsogna model [25], the simulation output consists of the two dimensional position and two dimensional velocity of each agent in an interacting group. The spatial domain is an unbounded plane and velocity is (theoretically) unbounded, so we perform no rescaling of coordinates.
The computational complexity of computing b k depends on the number of k-simplices. For n data points, the number of k-simplices is at most n kþ1 ¼ Oðn kþ1 Þ; of course, the actual number depends on the proximity parameter ε and the configuration of the data. Once the simplicial complex is constructed, computing homology over a field reduces to methods in linear algebra. The boundary operator @ k : C k ! C kÀ1 is the linear transformation realized as an integer matrix with entries {−1, 0, 1} over the basis elements of each vector space. The null space of this matrix corresponds to Z k , and the range space corresponds to B kÀ1 . The computational algorithm uses Gaussian elimination over a finite field, which is at worst case O(m 3 ), where m is the actual number of k-simplices. One needs efficient algorithms when computing over large data sets and recording homological information over a range of proximity parameter ε. See [40][41][42][43] for implementations and algorithms used in computing persistent homology. The development of faster algorithms is an active area of investigation.
To extract topological information, we process simulation data in the statistical computing environment R. Each time step of the simulation consists of a static point cloud of data in position-heading or position-velocity space. We use the phom package [44] to construct the Rips complex and calculate the topological barcodes of the point cloud. More specifically, we calculate the first two Betti numbers b 0 ðεÞ and b 1 ðεÞ where the proximity parameter ε takes on discrete but closely-spaced values. Calculating b k ðεÞ for k ! 2 is computationally costly, and we do this only for selected simulation snapshots.
Because we have a series of simulation time steps, we introduce a visualization that captures homological persistence over both scale ε and time t. That is, we now imagine the kth Betti number as b k ðε; tÞ, a function of the proximity parameter ε > 0 and simulation time t ! 0. A natural way to display b k ðε; tÞ is as a contour diagram, which we refer to as a Contour Realization Of Computed k-dimensional hole Evolution in the Rips complex (CROCKER). To facilitate visual interpretation of the contour diagram, we make two simplifications. First, we do not include every time step, but rather only every jth time step, where j is chosen to preserve at least several hundred snapshots per simulation. We found that this downsampling did not noticeably alter the appearance of our CROCKER plots. Second, focusing only on the most coarse and coherent topological structures, we only plot level curves for when b k ðε; tÞ < 5. All of the topological data for which b k ðε; tÞ ! 5 is lumped together in the contour diagram; that is, we do not draw distinct contours for b k ðε; tÞ ! 5. For the models we study, these regions of the contour diagram typically represent many topological structures that do not persist over scales.
As an example, consider Fig 6. Panels (B) and (C) show contour diagrams of b 0 ðε; tÞ and b 1 ðε; tÞ for a particular simulation of the Vicsek model (described and analyzed later). In these plots, simulation time t appears along the horizontal axis and the topological proximity parameter ε appears on the vertical axis. Focus first on panel (B), which shows b 0 ðε; tÞ. Below the purple (lowest) contour, b 0 ðε; tÞ ! 5, and thus in this region, the point cloud has many connected components; we interpret these as noise. Above the yellow (top) contour, b 0 ðε; tÞ ¼ 1, demonstrating that at large enough ε, the entire point cloud of data joins together into one connected component. Less trivially, there are regions between the intermediate contours that persist over scale and time. For example, centered near t = 2000 there is a somewhat triangular region between the yellow and green contours. In this region, b 0 ðε; tÞ ¼ 2, showing a strong, persistent signal of two connected components in the data. Panel (C) is similar, but shows b 1 ðε; tÞ. Towards the bottom of the diagram, there is an oblong region enclosed by a purple contour. In this region, b 1 ðε; tÞ ! 5, which we again interpret as topological noise. There are two large regions enclosed by a red contour in which b 1 ðε; tÞ ¼ 0, indicating an absence of topological circles. However, there is also a region between red and yellow contours in which b 1 ðε; tÞ ¼ 1, showing a strong signal of a persistent topological circle.
In summary, large regions in the contour diagram (excluding b k ! 5) represent topological features that persist over scale ε and simulation time t. When interpreting the contour diagrams, it is important to remember that the function b k ðε; tÞ inherently takes on only nonnegative integer values.

Analysis of the Vicsek Model
Using persistent homology, we now analyze data generated by aggregation models. One of the most referenced aggregation models is that of Vicsek and collaborators [11], cited thousands of times as of the writing of this manuscript. A complete discussion of results related to this model is beyond our present scope. The review paper [45] provides a broad look, and mentions some of the systems that have been described with Vicsek-like models, including cells, bacteria, insects, fish, and birds.
The Vicsek model is a dynamical system in discrete time and continuous space that describes the motion of interacting point particles in a square with periodic boundary conditions. The model appears in the literature written in different forms; we write it as v i ðt þ DtÞ ¼ v 0 ðcos y i ðt þ DtÞ; sin y i ðt þ DtÞÞ; ð9bÞ Here, x i (t) 2 R 2 is the position of particle i = 1, . . ., N at time t, and v i is velocity. We refer to the angle of the velocity vector v i as θ i , the heading. Additionally, v 0 is a constant, U is a uniform random variable on the specified interval, and Δt is the time step. For clarity, let us re-state the model in prose. To update the model, each particle must be given a new heading. This heading is the average of the previous headings of all other particles within a radius R, plus some added noise parameterized by η. With this new heading determined, each particle moves a fixed distance v 0 Δt, thus completing the time step. The model is posed on a square with sides of length ℓ and periodic boundary conditions. For all simulations performed, the initial particle positions and headings are random.
The parameters in the model are the number of particles N, the particle interaction radius R, the noise η, the fixed particle speed v 0 , the box size ℓ, and the time step Δt. One may nondimensionalize the problem to reduce the number of free parameters. We adopt the standard convention R = 1 and Δt = 1. This leaves the remaining parameters N, η, v 0 , and ℓ. Some studies refer to three effective parameters: η, v 0 , and ρ = N/ℓ 2 , a particle density.
Two preliminary matters will build understanding prior to a discussion of results. First, we analyze the topology of an initial condition. Second, we discuss the classic order parameter used in the physics literature to characterize the global dynamics of the system. Fig 5(A) shows a random initial condition for N = 300 particles and a square with sides of length ℓ = 25. This gives rise to a cloud of 300 points whose coordinates are (x, y, θ). However, since the simulation domain is periodic and since θ is an angle, this space is not R 3 but rather S 1 × S 1 × S 1 = T 3 , the three-torus, which has Betti numbers b ¼ ð1; 3; 3; 1; 0; . . .Þ. Panel (B) shows the topological barcode for b 0 ðε; 0Þ, which has one persistent bar. This topological signature indicates no clusters other than the trivial connected component formed by the entire point cloud on the longest scales. Panel (C) shows the barcode for b 1 ðε; 0Þ. There are three persistent bars (which terminate for larger values of ε not shown) representing three topological circles. This signature captures the fact that the point cloud is well-spread over the three-torus due to the randomness of the initial condition. We have also calculated the barcode (not shown) for b 2 ðε; 0Þ, which displays three persistent bars, representing the three trapped volumes of the three-torus. Panel (D) shows the persistence diagram, which combines the information captured in (B) and (C). The particular initial condition we have analyzed is used to seed our first simulation below. The initial conditions of the other two simulations that we will perform are topologically equivalent. Below, we will see that key dynamics involve the formation of nontrivial topological connected components and/or the destruction of topological circles.
As discussed in the introduction, a traditional approach is to characterize global behavior via an order parameter. For (9), the order parameter most often studied is the normalized average velocity of the group, The order parameter 0 φ(t) 1 measures global polarization. For a group of particles moving in approximately the same direction, φ(t) will be near one. If particle headings are spread out randomly, φ(t) will be near zero. As a simple additional example, two groups of particles that are highly polarized within each group but are traveling in opposite directions will also have φ(t) % 0. This simple example sheds some light on the limited information the order parameter carries. The model (9) can display three qualitatively different global behaviors, depending on parameters. We visualize snapshots of these states in Fig 7, which is analogous to Fig 1 of Vicsek's Topological Analysis of Biological Aggregations original paper [11]. When noise η is small and particle density ρ is small, the system tends to form clusters each of which moves in a different direction, as in (A). For higher η and ρ, particles are somewhat correlated but still move randomly, as in (B). Finally, for large ρ but small η, the motion becomes polarized, that is, all particles travel in the same direction, as in (C). For the simulations represented by these three panels, we will now view the global behavior through the lens of the order parameter φ(t) and through a topological lens. The topological analysis gives rich information which is detected neither by the order parameter nor, as we will show, by the eye.
Vicsek Simulation #1 Fig 6 displays the analysis of our first set of simulation results, the same simulation that began with the initial condition in Fig 5(A) and produced the state in Fig 7(A). Here, N = 300, ℓ = 25, and η = 0.1. The order parameter φ(t) increases steeply (with two small plateaus) before leveling off to a value very close to one, signaling a high degree of alignment. After time t = 1000, φ(t) varies little. Fig 6(B) shows b 0 ðε; tÞ, measuring the number of connected components on the three dimensional torus defined by position and heading. Recall that, per our previous discussion, we have displayed only contours for levels five and below. In the large region below the purple (bottom) contour, b 0 ðε; tÞ ! 5. This region represents connected components existing only over small ranges of ε; we interpret these as noise. In the region above the yellow (top) contour, b 0 ðε; tÞ ¼ 1, and all the data forms a single connected component for large ε. The spaces between the other contours reveal clusters that persist over scale and simulation time. For example, there is a triangular region between the green and yellow contours around t % 2000, indicating a strong signal of two clusters.
Further analyzing this region, consider the times marked by the two dashed gray bars, namely t = 2080 and t = 2150. While the order parameter φ(t) (by design) does not detect clusters, at t = 2080, the spaces between the contours reveal successive ranges of ε over which there are four, three, two, and one connected components in the data, with b 0 ðε; 2080Þ ¼ 2 persistent over the largest range of ε, as previously discussed. The corresponding simulation snapshot appears in Fig 8(A). The transition from four to three connected components corresponds to the merging of two small, slightly misaligned groups in the lower left of Fig 8(A) as ε increases. The transition from three to two connected components corresponds to the merging of the two larger groups and so on. The situation for t = 2150 is similar. However, the range of ε over which one counts three connected components has shrunk drastically. The simulation snapshot in Fig 8(B) suggests why. In Fig 8(A), the distance between the two leftmost groups approximately equals the distance between the two rightmost groups. Hence, there does not exist a large range of ε over which there are three clusters. It is important to remember that the data has three coordinates (x, y, θ), but in the discussion above, the key dynamics are in (x, y), consistent with the fact that φ(t) % 1. Fig 6(C) shows b 1 ðε; tÞ, measuring the number of topological circles formed by the data on the three-torus. First, we note that the three topological circles present in the initial condition (described above) are lost on a very short simulation time scale (nearly immediately). In the bottom, oblong region of the graph enclosed by the purple contour, b 1 ðε; tÞ ! 5 which, as before, we interpret as noise. There are two large regions enclosed by red in which there exist no topological circles. However, for approximately t > 1800 and 1.25 ε 2.1 we have b 1 ðε; tÞ ¼ 1, consistent with coverage of the data across one of the circles of the underlying three-torus. This circle is visible in Fig 8 as the vertical swath of data along the right side of each panel. The fact that there is a high degree of alignment removes one of the potential topological circles of the underlying space, and the lack of coverage horizontally across the spatial domain removes the other.
Taken together, panels (B) and (C) of Fig 6 show the following topology. First, b 0 ðε; tÞ reveals that a small number of clusters form persistently over time and scale. The number of clusters is variable as merging and fragmenting occur. Second, b 1 ðε; tÞ reveals that topological circles present in the initial condition are destroyed, but eventually one circle persists, consistent with coverage across one dimension of the simulation domain.
Vicsek Simulation #2   Fig 9 shows results from the same simulation as the snapshot in Fig 7(B). Here, N = 300, ℓ = 7, and η = 2. In panel (A), the order parameter φ(t) appears more noisy than in the previous simulation, and levels off to a value less than one.   Topological Analysis of Biological Aggregations spatial domain and have low alignment, with headings spread across [0, 2π). This state persists until approximately t % 50, at which time we see noisy formation and break up of additional topological circles in the range 1.5 ε 2. At time t % 210, a transition occurs, and we see a clear region where b 1 ðε; tÞ ¼ 2, indicating two persistent topological circles. The high (albeit not complete) alignment of particles visible in φ(t) suggests the lack of a topological circle in the heading coordinate θ. We have calculated for one time step that b 2 ðε; 600Þ ¼ 1. This trapped volume tells us that our data has the topology of a two-torus, that is, b ¼ ð1; 2; 1Þ for the first three Betti numbers. This topology eliminates the possibility that the two topological circles arise from holes in the data, which would have the first three Betti numbers b ¼ ð1; 2; 0Þ. We conclude that the two topological circles correspond to the periodic spatial domain. Neither dimension of this domain is favored, as suggested by the fact that the two topological circles close at the same value of ε, demonstrated by the yellow and red contours at the top of the plot being nearly coincident.
Vicsek Simulation #3 Fig 10 shows results from the same simulation as the snapshot in Fig 7(C). Here, N = 300, ℓ = 5, and η = 0.1. In panel (A), the order parameter φ(t) rises steeply to one with a small intermediate step. The order parameter here and in Fig 6(A) share some features, and yet we will see that the topological calculations capture the differences in the dynamics already apparent in the snapshots of Fig 7.  Fig 10(B) shows b 0 ðε; tÞ, measuring the number of connected components. In the large region below the purple (bottom) contour, b 0 ðε; tÞ ! 5, which we interpret as noise. In the large region at the top, above the yellow contour, b 0 ðε; tÞ ¼ 1, indicating one connected component at the largest scales. The most noticeable intermediate region is in between the yellow and green contours where b 0 ðε; tÞ ¼ 2, indicating a strong signal of two connected components at the scale 0.6 ε 1. We return to a discussion of these two connected components after a discussion of b 1 ðε; tÞ.
Fig 9(C) shows b 1 ðε; tÞ, measuring the number of topological circles. At t % 20, there is a marked transition to b 1 ðε; tÞ ¼ 2 in the upper portion of the contour diagram. Taking into account that the order parameter φ(t) indicates alignment-and thus the lack of a topological circle in the heading dimension-the data is reduced to living on the two-torus of the periodic spatial domain. For the snapshot t = 300, we have computed that b 2 ðε; 300Þ ¼ 0 in the range of ε where b 1 ðε; tÞ ¼ 2. This indicates there are no trapped volumes. The first three Betti numbers b ¼ ð1; 2; 0Þ are those of a punctured torus, and this interpretation is consistent with the hole in the data seen in the upper right quadrant of Fig 7(C).
We now return attention to the region of Fig 10(B) in which b 0 ðε; tÞ ¼ 2 over a range of ε, indicating two connected components. Once the group has achieved strong alignment, agents do not change their relative configuration, and the population travels as a rigid body. Thus, there is potential for any outliers to remain outliers terminally. This is indeed the case in our simulation. There is a single agent in Fig 7(C) (second from the top along the right hand border) who is isolated. It is only at the scale of the distance to its nearest neighbor that the transition to b 0 ðε; tÞ ¼ 1 occurs.

Summary of Vicsek Model Analysis
For both the first and third simulations, the order parameter φ(t)-shown in Figs 6(A) and 10 (A)-increases rapidly to φ(t) % 1 with little variation. However, the topological structures are distinct. The first simulation shows cluster formation and eventual coverage across one dimension of the spatial domain. The third simulation shows one cluster (with an outlier) and coverage across both dimensions of the spatial domain. Curiously, this is the same topological signature as the second simulation, even though the order parameter in 9(A) is noisy and achieves a lower degree of alignment. Thus, the persistent homology computations capture the fact that in some sense, the second and third simulations are topologically equivalent. Also, as previously discussed, these computations reveal dynamical events not discernible in the order parameter, and we find them to be a helpful complement.

Analysis of the D'Orsogna Model
Another model is that of D'Orsogna and collaborators [12,25], cited over 300 times as of the writing of this manuscript, and itself a refinement of [46]. In contrast to the alignment-driven Vicsek model, the D'Orsogna model hinges on attractive-repulsive interactions between particles and produces many patterns including rotating rings, traveling swarms, and vortex states -sometimes called mills-reminiscent of fish schools.
The D'Orsogna model is a continuous-time dynamical system that describes the motion of interacting point particles in an unbounded plane. The model takes the form of Newtonian force equations and thus is second order in time. The equations are Here, x i (t) 2 R 2 is the position of particle i = 1, . . ., N at time t, and v i (t) 2 R 2 is velocity. The first equation simply defines velocity as the derivative of position. The second equation is Newton's law, stating that mass times acceleration is equal to a sum of forces. These forces include self-propulsion of strength α, friction of strength β, and interaction forces described by the potential Q. The first term in Q describes repulsion of strength C r and characteristic length scale L r . The second term is similar, but describes attraction of strength C a and characteristic length scale L a . Put together, these two terms are similar to potentials used in molecular physics. In biological scenarios, typically L r < L a and C r > C a , meaning that repulsion occurs over shorter distances and is stronger. For an isolated pair of particles interacting solely according to this attractive-repulsive rule, and for appropriately chosen parameters, the potential has a unique minimum, and there exists an equilibrium distance at which attraction and repulsion balance. When one deals with an ensemble of N particles each experiencing pairwise interactions, the behavior is highly nontrivial. Restated in brief, (11) prescribes that each particle obeys Newton's law, with the relevant forces being self-propulsion, friction, and pairwise attraction-repulsion with all other particles. Arguably, one of the most intriguing behaviors of the model is the formation of mills, occurring in certain parameter regimes. These structures are annular in shape, with particles rotating around a hollow core. In a single mill, all particles travel with the same orientation (clockwise or counterclockwise). In a double mill, some particles travel clockwise and some travel counterclockwise. It is helpful to think about the topology of these states. A mill and a double mill have distinct topologies in four dimensional position-velocity space. The single mill is one connected component and one topological circle, that is, b ¼ ð1; 1; 0; . . .Þ. The double mill is two connected components each of which is a topological circle, that is, b ¼ ð2; 2; 0; . . .Þ. The two circles are concentric, nonplanar, and nonintersecting in four dimensional space.
We conduct a simulation of (11) with N = 500 particles, and model parameters α = 1.5, β = 0.5, C a = 0.5, C r = 1, L a = 2, L r = 0.5. Fig 11 shows selected simulation snapshots. For convenience, we color blue all particles traveling clockwise with respect to the center of mass of the group; particles traveling counterclockwise are red. In panel (A), at t = 5, the particles occupy a disk-like region with somewhat disorganized velocities. At t = 23, we see that a hollow core has begun to form. At t = 45, the mill structure has a well defined core. The group favors the clockwise direction, though there is a minority group of particles traveling counterclockwise. Fig 12(A) shows time series of three order parameter metrics used in [12] to characterize the global behavior of the system. Polarization P, defined in (Eq 1), is similar to φ(t) for the Vicsek model; it measures the degree to which agents are aligned. Because the group is traveling around a circle, P (red curve) remains low for the duration of the simulation. Angular momentum M, also defined in (Eq 1), helps quantify the rotation of the group. For a perfect mill structure, M = 1. For our simulation, we see an evolution from M % 0 early in time to M % 0.93 for later times (green curve). However, as mentioned previously, the metric M cannot distinguish between single and double mills, and so [12] introduces the absolute angular momentum M abs , defined in (Eq 2). The fact that M abs approaches unity signals that the asymptotic behavior of the group is rotational. The fact that M approaches a number close to but less than unity signals that a small minority of the group members are rotating counter to the majority. Fig 12(B) shows b 0 ðε; tÞ, measuring the number of connected components in the four dimensional space defined by position and velocity. In the large region below the purple (bottom) contour, b 0 ðε; tÞ ! 5. The connected components in this region persist only for very short ranges of ε; this is a noisy topological signal which we disregard. In the large region at the top, above the yellow contour, b 0 ðε; tÞ ¼ 1, indicating one connected component at the largest scales, that is, a component consisting of all the data points. The bottom and top regions in the graph are separated by a noisy boundary of intermediate curves. These curves indicate that at scales approximately in the range 0.5 < ε < 1.0 there is, over time, sporadic coagulation and fragmentation of connected components. For example, at times near t % 30 and t % 46, the region between the yellow and green contours is thicker, indicating a more persistent signal of two connected components.
Panel (C) shows b 1 ðε; tÞ, measuring the number of topological circles in four dimensional space. In the bottom region of the graph enclosed by the purple contour, b 1 ðε; tÞ ! 5 which, as before, we interpret as noise. At early times, the only non-noisy signal is the large b 1 ðε; tÞ ¼ 0 region above the red contour, indicating an absence of topological circles. Starting at t % 20, a marked transition occurs, and becomes persistent by t % 30. This transition detects the formation of a hole in the simulation, as shown in Fig 11(B). For t ! 30, there is a discernible region of the contour diagram in which b 1 ðε; tÞ ¼ 2, indicating two topological circles in four dimensional position-velocity space. For these same times, over larger values of ε, b 1 ðε; tÞ ¼ 1, indicating the loss of a circle. A topological circle could disappear by closing across its diameter or by merging with another circle. For our data, antipodal points in the mill with opposite orientations of travel are approximately 0.5 units apart (as seen in Fig 11) and this is (approximately) the proximity scale at which b 1 transitions from two to one, indicating that the two mills merge into one at this scale. The remaining topological circle will be lost when it closes on itself across its diameter for sufficiently large values of ε (not shown).
Pulling together the information from panels (B) and (C), we conclude the following. At times below t % 20, there is little topological structure. Then, a clear topological transition occurs. For later times, we have one or-intermittently-two connected components of data points. There are two discernible topological circles for smaller ε and one circle for larger ε. These circles survive for long periods of simulation time. The topological signature of the first

Conclusions
Inspired by physics, order parameters such as polarization and angular momentum have been useful for characterizing the global behavior of biological aggregations. We propose topological data analysis as an additional, valuable technology for understanding their group behavior.
We have performed numerical simulations of two well-known mathematical models of biological aggregations, resulting in point clouds of data that evolve in time. To understand the global behavior of each model, we study the topological structure of the point clouds by calculating their persistent homology. More specifically, we compute Betti numbers, which count connected components, topological circles, trapped volumes, and so forth.
To interpret the topological computations, we introduce a new visualization tool, namely a Contour Realization Of Computed k-dimensional hole Evolution in the Rips complex (CROCKER), which track Betti numbers across both proximity scale and simulation time. In topological data analysis, persistent features in a static point cloud correspond to long bars in a topological barcode. In our analysis, features persisting over scale and simulation time appear as large regions in the contour plot.
In Vicsek's model of aligning particles, the homological measures distinguish simulations that the usual alignment order parameter cannot. They also find topological similarity between simulations with different order parameter time series. In D'Orsogna's model of self-propelled, attracting-repelling particles, the topological calculations recognize the presence of a double mill state. In our study we have, for tutorial purposes, sought to explain our CROCKER plots by a subsequent manual examination of the data. That said, though phenomena such as group alignment, clustering, and double mills could be seen upon detailed examination of our raw simulation data, we would not have found them by eye if the topological methods had not first detected them.
One limitation of our work is that we have only calculated the first two Betti numbers, b 0 and b 1 , except for a small number of isolated cases in which we have also calculated b 2 . Calculating higher Betti numbers of our point clouds would yield additional information, but is computationally costly. Another limitation is that topological persistence over scale is different from persistence over time. For a fixed simulation time, the topological barcode is guaranteed to measure the same topological features through multiple proximity scales because nested sequences of simplicial complexes form a filtration over which homology persists. However, this guarantee does not hold over simulation time. For example, if b 0 ¼ 4 indicating four connected components in two successive frames of a simulation, there is no mathematical guarantee that these are the same four connected components. Nonetheless, because the aggregation models we study evolve smoothly in time, we expect persistent topological features to do so as well.
One attempt to address time evolution of topological features uses vineyards, which have been applied to protein folding in the context of level set persistence [47]. Another attempt might involve multidimensional persistence, which allows a persistence computation simultaneously over multiple parameters [48]. It could be useful to apply these two tools to biological aggregations. It could also be useful to consider another topological approach, namely braids, which have yielded insight into other dynamical systems applications such as fluids and crowd dynamics [49,50]. Finally, our main goal has been to demonstrate the utility of topological data analysis for biological aggregations and similar applications. We have used the Vicsek and D'Orsogna models as convenient examples, and focused on a small number of simulations. That said, it could be revealing to conduct large numbers of randomly-seeded simulations for fixed parameters, compute the persistent homology of each one, and average this topological data. Doing so would allow more precise quantification of the timescales and persistence scales of the topological transitions.
Topological data analysis is an active and growing area of current research. We hope that our work above contributes to the toolkit that applied mathematicians might bring to bear on models they study.

Author Contributions
Analyzed the data: CMT LZ TH. Wrote the paper: CMT LZ TH. Performed simulations: CMT.