A topological approach to selecting models of biological experiments

We use topological data analysis as a tool to analyze the fit of mathematical models to experimental data. This study is built on data obtained from motion tracking groups of aphids in [Nilsen et al., PLOS One, 2013] and two random walk models that were proposed to describe the data. One model incorporates social interactions between the insects via a functional dependence on an aphid’s distance to its nearest neighbor. The second model is a control model that ignores this dependence. We compare data from each model to data from experiment by performing statistical tests based on three different sets of measures. First, we use time series of order parameters commonly used in collective motion studies. These order parameters measure the overall polarization and angular momentum of the group, and do not rely on a priori knowledge of the models that produced the data. Second, we use order parameter time series that do rely on a priori knowledge, namely average distance to nearest neighbor and percentage of aphids moving. Third, we use computational persistent homology to calculate topological signatures of the data. Analysis of the a priori order parameters indicates that the interactive model better describes the experimental data than the control model does. The topological approach performs as well as these a priori order parameters and better than the other order parameters, suggesting the utility of the topological approach in the absence of specific knowledge of mechanisms underlying the data.


Introduction
Given data potentially described by various mathematical models, how might one choose between those models? In the context of experiments on social insects, we use topological data analysis (TDA) to inform this choice, and compare the topological approach to more traditional methods. Our investigation complements the rich literature on biological model selection [1,2].
In order to frame our investigation, we recapitulate key points of the biological experiments that form the basis of our work, namely those of [3]. In these experiments, pea aphids, Acyrthosiphon pisum, are filmed from above while they move in a circular experimental arena. For Aphids move intermittently, and so from this raw data, one may also estimate the motion state-moving or stopped-of each aphid in each movie frame. Finally, if an aphid is moving, one may estimate its direction of motion, also known as its heading. The authors of [3] propose two mathematical models. Both models describe transitions between an aphid's motion state with a random process (a biased coin flip). For moving aphids, both models describe the movement process as an unbiased, correlated random walk (described in more detail later). In the first model, referred to as the interactive model, for each aphid and at every moment in time, the motion state transition probabilities and the parameters of the random walk all depend on the distance to an aphid's nearest neighbor. This dependence is a simple way to account for the (potential) effect of nearby aphids due to social interactions. The parameters in the model are fit to the data using a simple optimization procedure. The second model, referred to as the control model, is similar to the first, except that the transition probabilities and random walk parameters are all taken to be constant, and hence, there are no social effects incorporated.
To assess the models, the authors of [3] choose several measures extracted from all of the raw data pooled over the experimental trials: the cumulative distribution of distance to an aphid's nearest neighbor; the cumulative distribution of angle between an aphid's direction of movement and the location of its nearest neighbor; and finally, the cumulative distribution of the proportion of aphids moving in each frame. They then make comparisons between these measures for the experimental data vs. the interactive model and the experimental data vs. the control model. For each of these comparisons, they use three measures of the distance between distributions: the difference between the median, the Kolmogorov-Smirnov distance (the maximum vertical distance between two cumulative distributions), and the Kullback-Leibler divergence (the expectation of the logarithmic difference between two probability distributions). For the distributions of distance to nearest neighbor and proportion of aphids moving, the distributions from the interactive model are closer to the experimental distributions than are those of the control model. For the distributions of angle to nearest neighbor, no significant difference is seen.
With this prior work now reviewed, we ask the question: what if the authors had chosen other quantities to compare? Distance to nearest neighbor, angle to nearest neighbor, and proportion of aphids moving are properties thought by the authors to be potentially important. However, there are many other aggregate measures that one might choose to select between models. Additionally, we note that if the authors had chosen only to examine angle to nearest neighbor, they might have concluded that the social interaction model and the control model are indistinguishable.
We will use tools from topological data analysis-specifically, persistent homology-to choose between models. Loosely speaking, topological data analysis looks at the shape of data, which is arguably a more neutral property than biological or physical measures that are derived from the data and are selected by an experimentalist or modeler.
As merely one example of the power of topological data analysis, we mention [4], which performs topological data analysis on numerical simulations of two influential models of biological aggregation [5,6]. In this study, biological group properties that are traditionally used in the literature-polarization of group motion, angular momentum, and so forth-are calculated for simulations conducted in different parameter regimes. The authors then compare these traditional measures with topological signatures of the data calculated using persistent homology. The topological signatures reveal important dynamical differences between simulations for which the traditional measures are similar, and conversely, detect topological similarity between simulations for which the traditional measures appear somewhat different. Also of note, the studies of [7] and [8] use so-called zigzag persistent homology to construct spatialtemporal topological features of time varying data, revealing interesting dynamic information and theoretical stability results, respectively. More broadly, topological data analysis is gaining ground as a methodology for studying biology, and has recently been applied to brain architecture [9], cancer genetics [10], epidemic networks [11] and much more.
In our investigation, for each of nine experimental trials on aphid motion, we compare data sets from experiment with those generated by the interactive and control models. We carry out these comparisons using three sets of measures. First, we use time series of order parameters frequently used in physics to assess collective motion, namely polarization, angular momentum, and absolute angular momentum. Then, we use a second set of order parameters, namely average nearest neighbor distance and percent of aphids moving. Since the control model does not incorporate social interactions and the interactive model does, one might naturally expect that this second group of order parameters should strongly reflect the model inputs. Third, we compare data using topological signatures called crockers, defined in [4].
Using these three sets of measures, we perform statistical tests to assess whether the interactive model or the control model better describes the experimental data. Analysis of the second set of order parameters, which use a priori knowledge of the model, indicates that the interactive model better describes the experimental data. The topological approach performs as well as these order parameters and better than the first set of order parameters, suggesting the utility of the topological approach in the absence of specific knowledge of mechanisms underlying the data. Additionally, our work is consistent with the results of [3], lending evidence to the idea that social interactions are important for pea aphid movement.
The rest of this paper is organized as follows. First, we provide additional detail on the experiments and models of [3], as well as on our computational implementation of the models. Then, we review key ideas from topological data analysis, with a focus on a tool known as computational persistent homology. Finally, to carry out the core of our study, we calculate order parameter time series and topological signatures, perform statistical tests on them, and compare the various approaches.

Models of aphid motion
We reprise some details of the experiments and modeling in [3]. During nine experimental trials, groups of 7 to 33 aphids were filmed for 45 minutes walking in a circular arena 40 cm in diameter. Fig 1 shows the initial state of the aphids in each trial. Fig 2(A) shows trajectories of the aphids in Experiment 9, obtained via motion-tracking software.
The model of [3] classifies each aphid according to two motion states, moving or stationary, during each 0.5 s video frame. Transitions between these states are probabilistic, with the probability of transition for each aphid depending on the distance d to that aphid's nearest neighbor. Moving aphids obey an unbiased correlated random walk. According to this model, over the course of one frame, each moving aphid takes a step of a certain length that depends on d. Then, it turns at a random angle drawn from a distribution that is centered around zero. The spread of the distribution also depends on d.
The two transition probabilities, the step length, and the turning angle distribution are modeled with various functional forms. These functional forms include parameters that are fit from experimental data. In reprising the model below, we omit explanations of the biological meanings of these fit parameters. These biological interpretations are not central to our investigation; detailed explanations appear in [3].
The probability P MS (d) of a moving aphid stopping in the next frame is with P 1 MS � 0:1280, P 0 MS � 0:5508, and d MS � 0.0134 m. The probability P SM (d) of a stationary aphid moving in the next frame is with P 0 SM � 0:1587, P 1 SM � 0:3552, d SM � 0.0079 m, and Δ SM � 0.0739 m. For moving aphids, the step length ℓ(d) is with ℓ 1 � 0.0013 m, ℓ 0 � 0.0003 m, and d ℓ � 0.0074 m. The turning angle θ is drawn from a wrapped Cauchy distribution centered at zero, The parameter 0 < ρ < 1 controls the spread of the distribution, with small values producing wider distributions and values closer to one producing strongly peaked ones. The spread parameter is with ρ 1 � 0.9013, ρ 0 � 0.1387, and d ρ � 0.0044 m. Descriptions in [3] provide biological interpretations of the functional forms above, the roles of the parameters therein, and the procedure for fitting the models to data in order to determine parameter values. For our purposes, the key point is that all four quantities P MS , P SM , ℓ, and ρ depend on distance to nearest neighbor, d. Through this dependence, social interactions enter into the model, and hence we refer to this model as the interactive model.
On the other hand, a control model from [3] ignores the effects of social interactions. The model is identical to the model above, but with d ! 1; that is to say, the model assumes aphids do not interact, so that P MS , P SM , ℓ, and ρ are constant.
Later, we will present results on numerical simulations of the interactive model and the control model described above. Simulations are seeded with the initial conditions extracted from the motion-tracking experiments. In simulating the models, it is necessary to implement of the interactive model and control model, respectively. We observe that in the experimental data some aphids were largely stationary, while others moved more around the arena. This is in contrast to both of the sample simulations, where aphid motion is much more pronounced, particularly in the control model.

Topological data analysis
Later, we will analyze our data with tools from topological data analysis (TDA). For now, we explain these methods. We provide a brief and nontechnical review of the main ideas followed by a slightly more technical explanation for the mathematically-inclined reader.
Topology is the branch of mathematics concerned with studying properties of objects that are preserved when that object is stretched, compressed, or bent (but not torn or glued). For example, a sphere and a cube are topologically equivalent because one can be bent into the other; similarly for a hollow circle and a hollow triangle. A hollow circle and a filled-in circle are not topologically equivalent because the former has a hole in the middle while the latter does not. To take a topological approach to data, we imagine that data points gathered from simulation or experiment represent some object. For instance, suppose we wish to analyze the position data of aphids from a single frame of experiment. Mathematics provides a way to create an object by taking our data points and connecting them with line segments, triangles, tetrahedrons, and other simple pieces. Then, we quantify the shape of this object by calculating the number of holes of different dimensions using topology. These quantities are called Betti numbers, and we say that they specify the object's homology. However, the particular object built from the data depends on a scale (distance) parameter called the proximity parameter or filtration value. We repeat the aforementioned process for many different values of this parameter and examine how homology changes with respect to this parameter. This is called calculating the persistent homology, and the output of this computation reveals a topological signature of our data. This entire process has been focused on data from a single time step. We can repeat the entire process for every time step, thereby producing a topological signature of our data over different values of proximity parameter and time.
More technically, our data can be viewed as a (potentially noisy) sampling from an underlying manifold or topological space. If we augment our data with a notion of distance between points then we can use persistent homology to understand the topological structure of our data at a variety of scales; see, e.g., [12][13][14][15].
Homology is a tool from algebraic topology that measures the features of a topological space [16][17][18]. Loosely speaking, homology measures the quantity of holes with boundaries of different dimensions k, and this quantity is encoded in the Betti number b k . The number of connected components (i.e., clusters) corresponds to b 0 , the number of topological circles (i.e., loops) corresponds to b 1 , the number of trapped volumes (i.e., voids) corresponds to b 2 , and so on.
One method of computing homology is via a simplicial complex, a collection of simple pieces such as vertices, edges, filled triangles, solid tetrahedra, and so on, called k-simplices for k = 0, 1, 2, 3, . . ., respectively. While there are a variety of approaches to create a simplicial complex, the Vietoris-Rips complex is a common choice as it is computationally tractable [13]. This simplicial complex is created by identifying data points as vertices and forming a k-simplex whenever k + 1 points are pairwise within some distance �, called the proximity parameter.
Therefore, the structure of the simplicial complex is highly dependent on this choice of parameter. Homology determines the k-cycles (in the case of a 1-cycle, this is analogous to a cycle within a graph) that are not boundaries of k + 1-simplices by splitting all k-cycles into equivalence classes, one for each distinct hole in the simplicial complex. Algebraically, the kth homology H k can be viewed as a quotient of vector spaces with dimension equal to the Betti number b k . The Betti number b 0 (�) corresponds to the number of connected components at scale � and b 1 (�) corresponds to the number of cycles at scale �.
To eliminate the need to choose a single proximity parameter �, persistent homology exploits the fact that as � grows, so do the Vietoris-Rips complexes, giving an inclusion of simplicial complexes for smaller � into those for larger values. Fig 3 shows the formation of a Vietoris-Rips complex over increasing � for a fixed point cloud of data. Persistent homology tracks homology classes as the proximity parameter is varied, recording the interval over which a kth order class (hole) first appears and when it no longer remains. These intervals represent the topology of a single-parameter family of spaces, a filtration, and are complete discrete invariants, meaning they capture all of the topological information in this family. For details regarding persistent homology and its computation, see [14,15,19].
In a variety of applications, including the analysis of biological aggregations, one may wish to analyze the topological structure of data as another parameter is varied in addition to proximity, for example, density or time. Unfortunately, there is no complete discrete invariant for multiple parameters similar to the interval representation for single-parameter persistent homology [20]. However, Betti numbers can still be computed as a function of multiple inputs. In [4], the authors propose the Contour Realization Of Computed k-dimensional hole Evolution in the Rips complex (crocker) as a means for visualizing the change in Betti number over two parameters. A function over two parameters can be conveniently visualized as a contour plot. Fig 4 is a toy example of a crocker plot which displays the Betti number, b k (t, �) as a function of the two inputs of simulation time t � 0 and proximity parameter � � 0, built by taking a discrete sampling of each parameter and displaying b k (t, �) first in matrix form in Fig 4(A) and as a contour plot in Fig 4(B). Large regions in the contour diagram represent topological features that persist over both the scale � and simulation time t. Alternative approaches to crocker plots are discussed as the dimension function of a 2-D persistence module in [21] or pseudo-multidimensional persistence in [22]. It should be noted that for a fixed time, the corresponding vertical cross section represents a filtration through the proximity parameter �, sometimes referred to as a Betti curve [23]. In  one time step to the next. However, as the aggregation models we study evolve smoothly in time, we expect the topological features to do so as well.
Crocker plots represent a matrix of data, and hence can be vectorized, resulting in a compact topological summary with flexible uses. First, since crockers can be viewed as vectors, they can be used in the application of myriad machine learning tools. For instance, one could use machine learning to predict model outcomes based on parameters.
Second, with this representation of the data, distinct homological dimensions may be concatenated together. For instance, the b 0(pos) and b 1(pos) crockers could be concatenated to simultaneously compare connected components and topological loops of a simulation. In some applications, e.g., [24], concatenating homological dimensions yields superior results than analyzing dimensions separately.
Third, crockers may be computed for any multi-parameter family of persistence modules and not just restricted to proximity and time, as in our work. While only two parameter families may conveniently be viewed as a contour plot, the Betti number can be computed for any given set of parameter values. For the topological features to be meaningful for one set of parameter values to the next, there should be a filtration or continuity in the data with changed parameter values. We will demonstrate that even though our data is not a bi-filtration-as there is not inclusion in the simplicial complex from one time step to the next-the simple crocker representation effectively captures topological changes of a time-varying system.

Results
We simulate aphid movement using both the interactive and control models, initialized with the same configuration as each of the nine experiments. We run 100 trials for each model-initial condition pair, for a total of 1800 trials. Each simulation has the same number of frames as the experiment from which its initial conditions were taken, ranging in number between 4605 and 5883. We compare the fitness of the two models of aphid motion with the experimental data using two different approaches: first, via several order parameters, and then, via topological data in the form of crockers.

Comparing models with order parameters
Biological group dynamics are commonly measured with a variety of order parameters designed to give some sense of the collective behavior. Examples include polarization, angular momentum, absolute angular momentum, and average distance to nearest neighbor; see, e.g., [5,6,25].
The polarization P 2 [0, 1] measures the group's global agreement on a heading, that is, to what degree individuals are moving in the same direction. Polarization is given by where v i is the velocity of the i th insect. When all individuals have the same heading, P = 1. Angular momentum M ang 2 [0, 1] measures the degree to which the aggregate behavior of the group is rotational. Angular momentum is where r i is the position of the i th individual taken relative to the center of mass of the group. When all individuals rotate perfectly and in the same orientation with respect to the center of mass, M ang = 1. Absolute angular momentum M abs 2 [0, 1] is This quantity is similar to M ang , except that it ignores the orientation of rotation. For example, a group in which some individuals move in clockwise circles around the center of mass and others move in counterclockwise circles (superposed counterrotating vortices) would have low M ang , because the two groups cancel out each other's momentum but would have high M abs . The average distance to nearest neighbor d a � 0 is This quantity gives some sense of the spatial distribution of the group. Finally, because the aphid models under consideration incorporate stop-start motion, we consider the percentage of aphids moving in each frame. In our simulation data, the algorithm outputs each aphid's motion state. In the experimental data, aphids are determined to be stationary if they move a sufficiently small distance from one frame to the next; we take this distance to be 10 −4 m. For brevity, in all of the order parameter definitions above, we have suppressed notation indicating the dependence on time, even though we will examine order parameter time series below.
We divide these five order parameters into two different categories. The first three, as mentioned previously, are simply order parameters that are frequently used in studies of collective motion. In advance of examining our data, we hypothesize that some of them may not be relevant to pea aphid motion. For example, because aphids do not appear to adjust their headings to align with their neighbors, we expect polarization may not be significant. Similarly, aphids are not known to produce strongly rotational motion, so we expect that the angular momentum order parameters may not be relevant.
The last two order parameters relate to inputs of the model, or, to restate, use a priori knowledge of the model. Because distance to nearest neighbor influences aphid step length and aphid motion state in the interactive model, it is natural to wonder whether aphid spatial distribution and overall level of movement will differ between the two models. Because stopstart motion is a key model ingredient, it is also natural to consider the overall level of movement within the group.
We proceed with order parameter analysis as follows. For a generic order parameter ϕ, we calculate ϕ in every frame of an experimental data set, giving rise to a time series ϕ exp (t s ), where s indexes the time step. For each model, we calculate the order parameter time series for 100 simulation runs, giving rise to time series � j int ðt s Þ for the interactive model and � j con ðt s Þ for the control model, where j = 1, . . ., 100 indexes simulation. We then compute the average distance between the experimental and model time series via a Euclidean norm, where the subscripts int and con refer to which model is compared to experiment.
With these distances defined, we perform a statistical analysis. Because we do not expect that either model should perfectly describe the complex biological system, we do not perform statistical tests on the null hypothesis that D int and/or D con are zero. Instead, we are interested in assessing whether one model produces data closer to the experiment than the other model. To this end, we calculate the difference in means D = D con − D int and construct 95% confidence intervals D ± R 95% , where R 95% is the radius of the interval. In constructing these intervals, we use a Bonferroni correction with n = 81 to account for the multiple comparison performed in our results, namely nine measures (five order parameters and four topological quantities, introduced below) on nine experiments. If a confidence interval does not include zero, we conclude that one of the models produces data more faithful to the experiment than the other model. Table 1 summarizes results of our statistical tests on order parameters. Data to the left of the thick vertical line summarize statistical tests performed on the three order parameters that do not use a priori knowledge of the models: polarization P, angular momentum M ang , and absolute angular momentum M abs . Data to the right of the thick vertical line correspond to the two order parameters that do use a priori knowledge of the models: average distance to nearest neighbor d a and overall percentage of insects moving Mov % . Even with Bonferroni corrections, all statistical comparisons show up as significant. We color results green or red depending on whether the interactive model or control model, respectively, is more faithful to experiment.
The order parameters that do not use a priori knowledge of the models give a mixed message. For polarization P, the interactive model (green) agrees better with experimental data for four of the experimental trials and the control model (red) agrees better for the other five. Angular momentum M ang is split as well; we see the control model as a closer fit in seven out of nine trials. Finally, for the absolute angular momentum M abs , the control model is closer for all trials. We conjecture that this somewhat counterintuitive result is related to the reflective Data to the left of the thick vertical line summarize statistical tests performed on three order parameters that do not use a priori knowledge of the models: polarization P, angular momentum M ang , and absolute angular momentum M abs . Data to the right of the thick vertical line correspond to two order parameters that do use a priori knowledge of the models: average distance to nearest neighbor d a and overall percentage of insects moving Mov % . For each order parameter and each experiment, we calculate the average Euclidean distance D int between the experimental data and data produced by 100 runs of the interactive model seeded with the same initial condition as the experiment. We also calculate D con , which is similar, but for the control model. See Eqs (10) and (11) and main text for more detail. To assess whether one model produces data closer to the experiment than the other model, we calculate D = D con − D int , shown in the boundary condition implemented in both models, which may impact the model results differentially and may be a poor reflection of biological reality. For the future, there are several potential ways to remedy this shortcoming. One would be to conduct experimental aphid trials using a large enough arena such that aphids do not encounter the boundary during the experiment, and then, correspondingly, to build a model that poses aphids on an unbounded domain. Unfortunately, this may pose a challenge because this arena would require a large camera field of view, but video camera resolution is limited. Another approach could be to conduct additional analysis on the original experimental data in order to better tease out the role played by the boundary. In contrast, the order parameters that do use a priori knowledge of the models give a consistent message. For both average distance to nearest neighbor d a and percentage of aphids moving Mov % , the interactive model data is significantly closer to the experimental data than the control model is. This is not surprising as these order parameters capture properties closely related to model inputs hypothesized to be important during the modeling process, namely the location of the nearest neighbor and motion state transitions.

Comparing models with topology
Momentarily, we will present results comparing the two aphid motion models with experiment using topology. First, to aid this presentation, we further discuss how to interpret crockers to track aphid movement. Fig 5(A) shows the b 0 crocker plot corresponding to the position data from Experiment 7. Fig 5(B) and 5(C) show analogous information for realizations of the interactive and control models, seeded with the same initial condition as in Fig 5(A). Because these plots show b 0 and are based on position data, we refer to them as b 0(pos) crocker plots. Each crocker plot starts with many connected components for small values of �, and the number of components shrinks as � increases and components are joined. We display only contours of 11 or less, inferring higher contours as topological noise. Since each of the trials begins with aphids in the same place, we would expect the contours in each crocker plot to begin at the same levels. This is the case for most contours. However, the blue contour, which represents the boundary between the regions with two and three connected components, appears to begin higher in the experimental crocker plot than it does in the two simulated crocker plots. This is due to a single aphid being lost by the motion-tracking software for one frame, right after the start of the experiment. Motion-tracking software is imperfect, and in the original experimental data, it is common for aphids to be dropped and then picked back up in later frames. This phenomenon is also responsible for the abrupt vertical changes present in some of the contours in the experimental crocker plots.
In all three panels of Fig 5, there is an entity which is isolated in early frames and only connects to the rest of the aphids for larger values of �. Looking at the initial conditions for Experiment 7, shown in Fig 1, it is clear that this entity is the single aphid at the bottom-right of the arena. In the crocker plot representing the experimental data, this aphid does not move closer to the others until nearly halfway through the trial, though the vertical drop suggests that at this point it was lost by the motion-tracking software. However, in both of the simulation trials, the aphid joins the others earlier, a result of the models' randomness and tendency toward aggregation.
Additionally, both of the sample simulation trials show an upward trend in the crocker contours over time. This trend is even more pronounced in the average b 0(pos) crocker plots for each of the models, computed by taking the element-wise average of the topological data over 100 simulations. This averaging reduces noise as illustrated in Fig 6. The upward trend in the contours signifies that aphids are moving farther apart, dispersing in the arena. That is, when the contours are low, small values of � serve to connect all the aphids into connected components; when the contours are higher, connecting the aphids requires larger values of �, meaning the aphids themselves are farther apart.   This upward trend is present in the crocker plots of simulation trials initialized with the configurations from Experiments 1, 3, 6, 7, 8, and 9, and to a lesser extent in those with initial conditions from Experiment 4. In simulation trials with initial conditions from Experiments 2 and 5, however, the contours of the crocker plots are level or even have a slight initial dip. Average b 0(pos) crocker plots for Experiments 2, 3, and 7 are shown in Fig 6 as examples. In contrast, these same patterns are not apparent in the crocker plots of the experimental data. Looking at the initial conditions in Fig 1, we see that crocker plots that exhibit upward-trending contours correspond to initial conditions where aphids are clustered tightly; the plots without upward-trending contours are from initial conditions in which the aphids are initially more dispersed. This suggests that in both models, aphids are dispersing (or, in the case of simulation trials with initial conditions from Experiments 2 and 5, aggregating) toward a preferred social distance. Additionally, the average plots from the control model rise more quickly than plots from the interactive model. Since these trends are not present in the experimental crocker plots, this preferred distance does not appear to be present in real aphid interactions. This suggests that both models could be improved to better reflect patterns of movement present in the experimental data. It is notable that our observation of this trending phenomenon emerged from the crocker plot representation, was not apparent in the previous analysis of these models [3], and was not a known characteristic a priori.
We now use topology to compare the fitness of the two models with experiment. We consider two sets of data: aphid position, which is two-dimensional, and (normalized) aphid position and velocity concatenated, which is 4-dimensional. In order to fairly compare position and velocity data simultaneously in the second set of data, we normalize both position and velocity to be on the same scale. According to [3], the longest aphid step length is 0.0013 m, and the arena has radius 0.2 m. Therefore, we normalize the position data by 0.2 and the velocity by 0.0013. We compute persistent homology of the 2-D and 4-D point clouds at each time step, with time downsampled by a factor of four to speed up computations. The output of this computation is interval data in � describing the births and deaths of topological features. We use the maximum filtration value of � = 0.2 for the position data and � = 1.5 for the concatenated position and velocity data. We then convert the interval data to crocker plots corresponding to dimensions k = 0, 1, uniformly sampling 50 proximity values from 0 up to the maximum filtration. We refer to the calculated Betti numbers as b 0(pos) , b 1(pos) , b 0(posvel) , b 1(posvel) .
We proceed with the topological analysis in the same manner as for the order parameters. Given the matrix corresponding to a crocker plot C, we compare the experimental crocker plot C exp to each of the 100 simulations runs of both the interactive and control models, denoted C j int and C j con , respectively, where j = 1, . . ., 100 indexes simulation. We compute the average distance between the crocker plots of the experiment and the two models by taking the Frobenius norm of the matrices corresponding to the crocker plots D con � hjjC exp À C j con jj Fro i j : As before, we compute the difference in means D = D con − D int and the 95% confidence radii on D. Table 2 summarizes results of our statistical tests on the topological data. Again, even with Bonferroni corrections, all statistical comparisons show up as significant. We color results green or red depending on whether the interactive model or control model, respectively, is more faithful to experiment. All four of the topological representations give a consistent message that the interactive model crockers are significantly closer to the experimental crockers than the control model. The consistency revealed by these statistical tests shows that the topological crocker representation captures characteristics of the global motion of aphids that agrees with intuition behind the model development, namely that the model accounting for social interactions is more consistent with the experiment. The key point is that the topological approach uses no a priori knowledge about the data sets, nor the models that generated them.
In addition to these results, we perform analogous statistical tests on concatenated b 0(pos) and b 1(pos) crockers as well as concatenated b 0(posvel) and b 1(posvel) crockers. The results are similar to the dimension k = 0 representations in Table 2 as the k = 1 features are sparse.

Discussion and conclusion
We have assessed the goodness-of-fit of two models of aphid motion: an interactive model that describes social behavior of the insects and a control model that omits this effect. To compare the model results to experimental data, we performed statistical tests on three different sets of measures: order parameters that do not use a priori knowledge of the models, order parameters that do use this information, and topological measures in the form of crocker plots. Order parameters that rely on a priori knowledge give a consistent message, namely that the interactive model better describes the experimental data. The topological measures match the performance of the a priori order parameters and outperform the other order parameters. Of course, the topological measures do not rely on knowledge of the models. Thus, we find that adopting a topological lens may be a useful approach for characterizing and comparing motion of biological groups when one has little information about what model mechanisms might be important.
Crocker representations have the potential to encode richer information than traditional order parameters. Consider aphid proximity. The configuration of any aphids at a fixed moment in time will be joined into a single connected component when � reaches the maximum distance between any pair of aphids. On the other hand, the nearest neighbor order parameter condenses information into an average, losing nuance that the contours in the crocker plot retain. As a concrete example, consider the following two scenarios. In the first, These data summarize statistical tests based on four choices of topological measures: the zeroth and first Betti numbers of aphid position data as well as (normalized) aphid position and velocity concatenated, b 0(pos) , b 1(pos) , b 0(posvel) , and b 1(posvel) . Results are analogous to Table 1 but rather than using a Euclidean norm of the differences of vectors representing order parameter time series, we use the Frobenius norm of the differences of matrices representing crocker plots. See Eqs (12) and (13) and main text for more detail. As in Table 1, cases where the Bonferroni-corrected 95% confidence radii on D = D con − D int confidence interval, R 95% , exclude zero are colored green or red depending on whether the interactive model or control model, respectively, is more faithful to experiment. https://doi.org/10.1371/journal.pone.0213679.t002 imagine a single cluster of 10 objects where the average distance to an object's nearest neighbor is 2 cm. In the second, imagine ten objects scattered in five pairs, where the two objects within each pair are 2 cm apart, but two aphids from different pairs are at least 5 cm from each other. The nearest neighbor order parameter would be equal to 2 in both cases but the b 0 crocker plots would be able to distinguish between differing configurations. In the crocker for the first case, there would be only one component for � � 2. However, in the second crocker, there would be five connected components at � = 2, and the number would only decrease for � � 5. Thus, while the average distance to nearest neighbor order parameter and the crocker plot are related, the latter contains more information.
One limit of our approach involves computational complexity. Computation of order parameters involves determining a single number for each time step. Crockers, on the other hand, compute persistent homology of a point cloud at each time step, which is a multi-scale approach and is thus a more costly computation; see [4] for a discussion of the computational complexity. Developing fast software to compute persistence is an active area of research in the topological data analysis community, and recent advances have sped up computations by several orders of magnitude; see [26] for benchmarking on a set of algorithms and implementations. Nevertheless, computing persistent homology will be more computationally expensive than computing a traditional order parameter. Still, we have found it to reveal useful information about the structure of the data.
Another limit of our approach involves our choice of norm. We have selected the Euclidean norm to compute the distance between order parameters. Therefore, for consistency, we selected the Frobenius norm to compute the distance between crocker matrices, since it is analogous to the Euclidean norm on a vectorized matrix. Another natural choice of metric on crockers might be the Wasserstein distance, commonly called the earth mover's distance. Selecting other distance measures could change our results, and we leave this as future work.