Expectation-Maximization Binary Clustering for Behavioural Annotation

The growing capacity to process and store animal tracks has spurred the development of new methods to segment animal trajectories into elementary units of movement. Key challenges for movement trajectory segmentation are to (i) minimize the need of supervision, (ii) reduce computational costs, (iii) minimize the need of prior assumptions (e.g. simple parametrizations), and (iv) capture biologically meaningful semantics, useful across a broad range of species. We introduce the Expectation-Maximization binary Clustering (EMbC), a general purpose, unsupervised approach to multivariate data clustering. The EMbC is a variant of the Expectation-Maximization Clustering (EMC), a clustering algorithm based on the maximum likelihood estimation of a Gaussian mixture model. This is an iterative algorithm with a closed form step solution and hence a reasonable computational cost. The method looks for a good compromise between statistical soundness and ease and generality of use (by minimizing prior assumptions and favouring the semantic interpretation of the final clustering). Here we focus on the suitability of the EMbC algorithm for behavioural annotation of movement data. We show and discuss the EMbC outputs in both simulated trajectories and empirical movement trajectories including different species and different tracking methodologies. We use synthetic trajectories to assess the performance of EMbC compared to classic EMC and Hidden Markov Models. Empirical trajectories allow us to explore the robustness of the EMbC to data loss and data inaccuracies, and assess the relationship between EMbC output and expert label assignments. Additionally, we suggest a smoothing procedure to account for temporal correlations among labels, and a proper visualization of the output for movement trajectories. Our algorithm is available as an R-package with a set of complementary functions to ease the analysis.


Introduction
Current movement research is undergoing a revolution.The growing capacity to collect highresolution spatiotemporal movement data and radical improvements in data management and processing are unprecedented in this field and reminiscent of the bioinformatics revolution of genomics and proteomics two decades ago [1].A key challenge now is the analysis of massive movement datasets with largely different sampling rates and accuracies (e.g.high resolution GPS, standard GPS, Argos satellite, geolocation).In particular, it is critical to identify, in an unsupervised manner, movement trajectories' functional units, known as behavioural modes [1][2][3].The behavioural mode is to the movement trajectory what gene is to the DNA sequence [4].
Animal movement can be analyzed as a set of measurable behavioural responses to a combination of internal states, environmental factors, and evolutionary/biological constraints.Such behavioural responses or modes are manifestations of the organism's decision mechanism, providing information about the cognitive process driving the movement [5].Identifying behaviourally significant movement modes is crucial to bringing research beyond mere statistical descriptions of movement patterns and unraveling the underlying biological processes that determine animals movement and behaviour.Establishing robust connections between patterns and processes is important for the biological interpretation of the data but also for nurturing models of movement with larger predictive capacity.In this sense, only a mechanistic approach to movement modelling can really contribute to solve relevant environmental problems such as vector-born disease spread, migratory route alteration due to climate change, or the preservation of endangered species.
Classic approaches to movement trajectory segmentation focus on the trajectory's structure by using local measures based on tortuosity [6], first-passage time [7], residence time [8], and positional entropy [9,10].More sophisticated procedures involve Bayesian estimation methods to model spatio-temporal probabilities of being in a given behavioural mode or state [11,12].These can include location errors as well as environmental information [13][14][15][16].Recent examples clearly show the suitability of state-space models, in particular hidden Markov models (HMMs) [17,18] and multi-state models for continuous time processes (MSM) [19,20], for estimation and inference of behavioural modes.State-space approaches provide mechanistic and statistically sound insights about movement patterns but rely on strong a priori assumptions about movement organization into states and the distributions that regulate state transitions, usually requiring species-specific and time-consuming parameter estimations.There are several frameworks for state-space modelling and the criteria to identify the best framework for a given problem still remain unclear [21].Other methods like Behavioural Change Point Analysis [22] or t-Stochastic Neighbouring Embedding (tSNE) [23,24] are more direct than state-space modelling but are limited by their computational intensity and because they describe behaviours in a continuous parameter space not easy to interpret or discretize into modes.
Ultimately, the challenge is to devise unsupervised methods based on sufficiently sound and robust statistical methodologies, but keeping results understandable.Cumulative sums based methods [5,25] approach this ideal.Conversely, behavioural annotation procedures that require intense computational resources or heavy data-specific supervision [17,18,20,24] are limited when dealing with massive amounts of data, or when comparing patterns across different populations, ecological contexts (e.g.breeding vs. migration), or tracking methodologies.
Here we develop a general, robust and computationally efficient method to identify behavioural modes in movement trajectories based on a non-supervised multi-variate clustering algorithm, that takes into account the correlations and uncertainty of the variables, is sensible to multiple time-scale events, and yields a meaningful labelling that can be easily linked to a semantic annotation.The method stands out in accomplishing a good compromise between the statistical significance and the biological interpretability of the output.It is based on the assumption that behavioural modes can be described by a mixture of Gaussian distributions over a binary partition of the input space but with no further prior assumptions, thus minimizing biases and sensitivity to initial conditions.

Materials and Methods
Gaussian mixtures are not new in animal movement modelling.The Gaussian mixture assumption compares to the current use of composite Brownian motion in mechanistic models of search [26,27].Also, modelling approaches that assume Gaussian mixtures for movement variables such as speed have already proved useful for classifying animal tracking data into discrete movement modes [28,29].The novelty here is to use Gaussian mixtures into an EM-Clustering framework to behaviourally annotate movement trajectories in a meaningful way.

Expectation-Maximization Clustering
The Expectation-maximization EM algorithm [30,31] is a well sounded, general, iterative procedure for the maximum likelihood estimate of a parametric distribution underlying some given data eventually incomplete or with missing values.A particular case of this algorithm is the parameter estimation of a Gaussian Mixture Model (GMM) when the generating Gaussian of each observation is unknown, commonly known as Expectation-Maximization Clustering (EMC), a well known methodology for the identification of clusters (i.e.different classes or patterns) in a multivariate data set.
The EMC formal statement is the following: • given a set of data X = {x 1 , . . ., x n }, where each data point , (1 ≤ i ≤ n), is a vector of values corresponding to m variables, the EMC fits a k multivariate-Gaussian Mixture Model defined by the parametric set Θ = {µ 1 , Σ 1 , π 1 , . . ., µ k , Σ k , π k }, where µ j , Σ j , π j are respectively the vector of means, the covariance matrix and the mixing coefficient of multivariate Gaussian j, (1 ≤ j ≤ k).
EMC is a two steps iterative optimization method to estimate Θ * , alternating between estimates of the probability of a particular observation belonging to each cluster, and estimates about the parameters Θ that maximize the likelihood of these probabilities.For a GMM, the maximization equations have a forward analytical solution [32,33] that greatly simplifies the optimization procedure.In a few words, the algorithm proceeds as follows: 1. Initialization: take a guess Θ g over the set of parameters; 2. Iteration loop: given the current guess Θ g , (a) Expectation step: for each data point i and each cluster j, compute the likelihood weight w ij , (a posterior probability), of x i being generated by Gaussian j, given by, where N x i | µ j , Σ j is the multivariate Gaussian density function: (b) Maximization step: compute a new set of parameters Θ new that maximizes the likelihood of these weights, given by the expressions, where µ (c) Use Θ new as the parametric guess for the next iteration, that is, take Θ g ≡ Θ new .

3.
Output classification: at the end of the process, each data point is assigned to the most probable cluster.
This iterative procedure is theoretically guaranteed to increase the likelihood at each step and, although the algorithm does not promise to reach a global maximum of the likelihood function, it is indeed guaranteed to converge to a local maximum, dependent on the initial conditions.In practice, it is common to start it from multiple random initial guesses and select the one with the largest likelihood.Usually, the process is stopped after a prefixed number of iterations or when the increments of likelihood are less than a prefixed δ.
In a typical EMC implementation, the number of desired output clusters must be specified, and the algorithm will return that number of clusters.A value σ (r,r) min > 0 (1 ≤ r ≤ m) must be specified to limit the minimum variance of each variable.This parameter avoids errors derived from indefinite covariance matrices along the optimization process and, in practical terms, can also be used to prevent the algorithm from yielding clusters of undesired levels of precision.

EMbC Algorithm
There exists a family of variants of the EM algorithm, known as the generalized EM algorithm, where the main focus is to overcome particular problems (e.g.difficult E-step/M-step computations, slow convergence, [34]).The Expectation-Maximization binary Clustering (EMbC) algorithm is a variant of the EMC algorithm [30,31] based on a relaxation of the maximization step.The novelty resides in the fact that the clustering is driven towards a meaningful classification that should be easier to interpret by experts.Also, as introduced in [35,36] the algorithm allows taking into account uncertainties associated to the data points.The aim is to address (i) clustering interpretability and, (ii) data reliability, two key issues in behavioural annotation of movement data.

Clustering Interpretability
In any unsupervised clustering procedure, one should distinguish the elementary labelling identifying the different clusters, from the semantic labelling, the intuitive interpretation of the different classes identified, directly evoking specific behaviours, and these two labellings are not always easy to link when the output comes from a classical implementation of the EM-Clustering.Therefore, we want to avoid clustering configurations that are difficult to interpret in behavioural terms, that is, that do not show clear semantics.
Meaningful clustering is achieved by introducing a set of parameters, denoted as delimiters.A delimiter is a value that splits the range of a variable into a binary discretization.The whole set of delimiters defines a partition of the variable space into regions where each variable takes either low (L) or high (H) values.The binary nature of this partition is what favours the link between elementary and semantic labelling.Classical behavioural annotation is commonly based on velocity and a turning behaviour estimate (e.g.turning angle, angular correlation, tortuousity).In this case, a binary labelling has a direct intuitive interpretation: low velocities and low turns (LL) can be interpreted as resting, low velocities and high turns (LH) as intensive search, high velocities and low turns (HL) as travelling or relocation, and high velocities and high turns (HH) as extensive search.The semantic annotation is however variable-dependent and species-specific.
In a general multivarite case, each delimiter is associated to two adjacent clusters having the same combination of low and high values for all variables, except for the splitting variable, which takes low values in one and high values in the other.In other words, we have one delimiter for each variable and each combination of highs and lows of the rest of the variables.For m variables this makes a total of m 2 m−1 delimiters.We use a multivariate notation denoting delimiters by r Z where Z is a length m subindex, based on an ordered sequence of the variables.Each element of the subindex is either L or H except for the splitting variable for which we use a dot, according to the combination of values at both adjacent clusters.As an example, in a 3-variate case, r L.H denotes the delimiter for the second variable, separating the two clusters LLH and LHH, in which the first and the third variables take low and high values, respectively.
Conceptually, the delimiters are related to the frontier of equiprobability between two adjacent clusters, and are used to bound the computation of the Gaussian means within the regions that they delimit.In such a way, the mean of each cluster can not drift away from its associated binary region.To illustrate this issue, we show a comparison of two bivariate (velocity and turning angle) clusterings of the same trajectory (Figure 1): one resulting from a classical EMC implementation (left panel), and the other resulting from the EMbC variant, the dashed lines depicting the final value of the delimiters computed by the EMbC algorithm (right panel).Starting with exact initial conditions, these two algorithms yield output clusterings corresponding to different local optima.In the left panel it is difficult to grasp a clear cluster semantics, based on velocity/turn.In the right panel the clustering shows a meaningful partition of the variable space into LL/LH/HL/HH regions, accounting for a clear cluster semantics.
Regardless of the values of the delimiters, the EMbC assigns data-points to the most probable cluster.This is the reason for the few mismatches that can be observed between the clusters and the binary regions in the right panel.Only in case of equal probabilities, the delimiters constitute a further criterion to assign labels to data-points.
There is no guarantee that either of the algorithms (EMC and EMbC) will always be better in terms of likelihood.Both EMC and EMbC will just reach the best optimum attainable from any given starting point.In our example, the local optimum based on EMbC is better (Figure 1 right panel), but this must not be generalized.The concern here is not to reach higher likelihood partitions but rather to reach meaningful partitions even at some cost in likelihood.

Data reliability
Movement trajectory data is associated to different sources of uncertainty: (i) spatial errors due to location inaccuracies, depending on the system used (e.g.GPS, ARGOS), and also varying due to possible interferences in signal transmissions of the geopositioning system; and (ii) temporal errors due to difficulties in inferring a location at a given time, which generates sampling heterogeneity.Therefore, estimated variables such as velocity or turns, which depend on the sampling rate and on the locations themselves (Section 8.1), present different degrees of reliability or accuracy.
As introduced in [35,36], reliability of the data is implemented as an additional weighting coefficient in Equations 3 and 4, giving less weight to the less accurate values in the estimation of the Gaussian parameters, and favouring, instead, the more accurate ones.These coefficients should be given by a reliability function that can not be generalized, as it will be variable-specific and dependent on the source of error considered.As a general rule, however, it should yield normalized reliability values.For instance, an adequate reliability function to account for the uncertainty in the velocity values due to heterogeneous sampling rates, can be defined in terms of the time interval τ i = t i+1 − t i associated to location i with timestamp t i , where the superscript l accounts for the component index of variable velocity, and τ is the most frequent time interval (the mode of the τ frequency distribution), for which we assume the baseline of reliability of the trajectory, so that for τ i ≤ τ we assign a maximum reliability u decreasing as τ i increases.Equation 5 can be generalized to q uncertainty components to include other sources of error leveraging on the measured variable (e.g. the spatial error of the sampling device).In this case we consider a vectorial reliability function u ) and take its normalized length as given by,

Implementation
Implementing the modifications described above imply relevant changes in the M -aximization step of the algorithm: 1. at each iteration, and given the current parameters, we must first compute the delimiters.In a few words, the values of the delimiters are computed as the point of minimum difference in likelihood weight along the straight line connecting the means of the adjacent clusters (see Figure 2).
2. for each cluster j, we must determine the set R j of points lying in the region determined by the corresponding delimiters, (we note that in the general case, the delimiters will not constitute a perfectly definite partition of the variables space, and some points may belong to different regions at the same time, as shown in Figure 3, contributing to the computation of the Gaussian means of all related clusters); 3. finally, we recompute the Gaussian parameters, bounding equation 3 to the sets R j and including the reliability function in equations 3 and 4, that is, where u (r,s) i weights the combined effect of uncertainty on variables r and s, and is computed as, The delimiters become an essential part of the parametric set Θ, and therefore, the model is no longer a standard GMM but a restricted variant.The optimization through the likelihood space is driven by the new conditions imposed, which force each Gaussian to have its mean inside meaningful regions, restricting the potential positions of the cluster centroids and the type of clusterings allowed.
A major consequence is that equations 7 and 8 do not correspond to maximization expressions.This change however, does not jeopardize the convergence of the EMbC algorithm.We have no formal proof but our empirical tests have shown that convergence is always achieved.The effect of our modifications in the EMC algorithm is a constantly increasing likelihood optimization process, interspersed with likelihood drops at sporadic iterations.Every drop in likelihood can be considered a restart in the likelihood landscape from a new (but not so blind) guess, with the likelihood being lower compared to the previous step but higher compared to the likelihood value from which we started.
Unlike the EMC algorithm, the number of output clusters is given here by the number of variables used, k = 2 m .However, during the process of likelihood optimization, some clusters can be merged.Thus, k = 2 m is only an upper bound to the final number of clusters.This limitation in the number of clusters is not a drawback but rather a consistency with our main motivation of favouring the semantic interpretation of the final clustering.Although there is no restriction on the number of variables (we are not considering computational limitations) the EMbC is intended to be used with not more than 5 or 6 variables, yielding a maximum of 32 or 64 clusters, what is usually far beyond the number of potential behaviours of interest in any biological application, and far beyond the number of behaviours that an expert might easily handle.The key point here is to determine a few variables conveying the right information to decode the set of behaviours of interest.
The parameter σ (r,r) min can be set by default to orders of magnitude lower than the expected variances (e.g σ (r,r) min = 10 −32 ).However, it can also be used to limit the minimum range of variability (i.e.minimum standard deviation) within the clusters obtained.Rather than a subjective question, this is usually related to the physical concept expressed or measured by the variable under consideration.For instance, regarding to movement variables like velocity and turn, σ (r,r) min can be directly related to physical/biological constraints as well as to measurement errors (or maximum resolution) of geolocation devices.Thus, values of σ (r,r) min in the order of 0.01 m/s for velocity and 0.087 rad (5 degrees) for turns, would work for a wide range of species.For this reason, σ (r,r) min should be regarded as a variable specific factor rather than a user free parameter, herein, our claim that the EMbC is indeed a parameter free algorithm.
Also important is to minimize prior assumptions, biases and sensitivity to initial conditions.With this aim, the EMbC starting point is automatically set as the most uninformative condition, that is: (i) each data point is assigned a uniform probability of belonging to each cluster, (ii) the prior marginal distribution of the clusters is also uniform (each cluster starts with the same number of data points), (iii) the starting partition is selected based on a global maximum variance criterium.

Smoothing
The EMbC algorithm generates local behavioural annotations without considering their temporal context.Based on EMbC, labels are given for each observed location and reveal any small change in behaviour irrespective of how this change is framed within a predominant behavioural mode at a larger scale.This fine level of detail is important for posing further meta-analysis procedures (e.g.sequence/pattern mining), but it might not constitute the expected output if the aim is to track coarser-grained behavioural patterns.
If coarser patterns are desired, the EMbC provides two means for smoothing the output: 1. a pre-processing of the trajectory using running windows to compute averaged local measures, with the length of the running window representing a behavioural scale of interest; 2. a post-processing of the output based on the temporal behavioural correlations, a feature explicitly implemented in state-space segmentation algorithms [11,12,17,18].
In the latter case, the smoothing function makes use of the likelihood weights w ij of location i belonging to cluster j, information provided by the EMbC algorithm.In its most basic implementation, the function looks for singles, that is, locations with labels that differ from equally labelled neighbouring locations, and checks the condition (w ic − w in ) ≤ δ w , where i is the single location index, w ic and w in are the likelihood weights with respect to its current and its neighbouring assignments (clusters c and n, respectively), and δ w is a threshold parameter expressing the user's will to accept the change.The subjectiveness of this parameter is our reason for keeping this smoothing function apart from the overall clustering procedure.

Visualization
Another aspect of behavioural annotation of movement trajectories is finding adequate ways to visualize the results.Although the EMbC labelling is local, behavioural labels tend to correlate in time.We propose a visualization based on three dimensions: space, time, and behaviour, using coloured lines and circles (colour for behaviours, line lengths for space, and circle radius for time).We use this visualization to show the segmentation of our empirical trajectories in Figures 6,7,8,9.This involves the conversion into unique lines of all consecutive locations sharing the same label (bursted visualization).At the mid-point of each generated burst we associate a circle with radius proportional to the duration of the behaviour.This enables the viewer to discern the distance travelled and the duration of each behaviour simultaneously.If distance travelled is small but duration is large, the circle dominates the line.If the opposite is true, the line dominates over the circle.

Experimental Setup
We used simulated and empirical trajectories to asses the EMbC algorithm and illustrate its main features.Without loss of generalization, behaviour annotation of the trajectories was based on local measures of velocity and turn (Section 8.1).
Synthetically labelled movement paths allowed us to measure the performance of the algorithm.Beyond this quantitative assesment, our main goal was to explicitely depict the gap that the EMbC algorithm fills in comparison to closely related methods such as EMC and HMM.We used the implementations of these algorithms included in the R-packages EMCluster [37], [38] and DepmixS4 [39].
We also applied the EMbC algorithm to a variety of empirical tracks (see Table 1) including high-resolution GPS (shearwater, stork), standard GPS (bat), and ARGOS (osprey).We visualized the algorithm outputs and we also checked its robustness with respect to two potential sources of error: data loss and data inaccuracy.
We studied the robustness to data loss by removing data points from the precomputed set of velocity/turn pairs (Figure 11a).Note that we cannot study data loss by eliminating locations from the trajectory because this would simply change the values of velocity and turn (which depend on actual sampling intervals), leading to a totally new dataset.In the subsampling process it is important to preserve the underlying behavioural distribution.Thus, subsampled datasets were generated by assigning a uniform random value 0 ≤ p i ≤ 1 to each data point and removing all those points with p i < k dl , with 0 ≤ k dl ≤ 1 being the data loss factor.For each empirical trajectory, we generated datasets with different values of k dl and we compared each of them with the full dataset.
We also explored to what extent the implementation of reliability decreased the sensitivity of the final clustering to data inaccuracies due to sampling heterogeneities.We did so by jittering actual values of velocity/turn in the scatter plots (Figure 11b) using a bounded uniform distribution as noise function, where x i and xi are the original and jittered velocity/turn values, min (x) and max (x) are the minimum and maximum x values respectively, and, which positively correlates the jittering range with the length of the sampling interval, with correlation factor k di max (x) ; 0 ≤ k di 1.The larger the time gap between two successive locations, the larger the potential inaccuracy in our velocity and turn estimates.For each empirical trajectory, and for different k di values, we compared the labellings obtained in jittered datasets with the corresponding non-jittered labellings, with and without implementing reliability in the EMbC algorithm.This is a particular example focused on the uncertainty derived from sampling heterogeneity but the same could apply to other sources of uncertainty, such as location spatial errors.The effects would be similar, since the higher the uncertainty of the values, the less its influence in determining the final clustering.

Results
We show plots of the trajectories with colour labelled locations (point wise and/or burst wise visualization), with insets on some particular chunks of interest.We also show the clustering with a coloured scatterplot of the input data set, a temporal profile of the labelling to depict the degree of temporal correlation of behaviours, and also histograms of both velocity/turn variables to highlight the strength/weakness of the correlations among them that will ultimately determine the shape of the clustering (Figures 5, 6, 7, 8, 9).

Simulated trajectories
On average, the EMbC algorithm almost perfectly captured the modelled clusters, with some expected sensitivity to both the definition of the clusters γ and the size of the data set n (Figure 4).Performance was above 90% for n ≥ 200 decreasing arround 80% for the shortest trajectories.
With Markov-chain-based trajectories all algorithms presented a similar behaviour: for γ = {0.01,0.05} (blurred clusters) the performance of the EMbC was between the HMM (the best) and the EMC, but for γ = 0.1 (well defined clusters) and beyond n = 200 the EMbC outperformed the HMM.Conversly, for mixture-prior-based trajectories, the EMbC and the EMC presented similar results, the EMbC performing slightly better for low values of γ and n, while the performance of the HMM presented a clear degradation in relation to the previous set.
These results highlight how each algorithm makes use of the sources of information implicit in the data set: the binary partition of the variable space, more or less defined depending on the γ factor, and the temporal correlation of the states, dependant on the transition matrix, and consequently only present in the case of the Markov-chain-based trajectories, both sources obviously enhanced by the size of the trajectory.
The EMbC and the EMC totally disregard the temporal correlation but take clear advantage of the definition of the clusters, present in both sets of tajectories, thus achieving similar degrees of performance in both cases, (slightly better for the mixture-prior-based set).In general, the larger the size of the data set, the more evidence about the space partition, and the better the results for both.But the special fitness of the EMbC for these data sets shows up when the information is compromised, either because the clusters are not so well defined (low valuse of γ) or because the amount of information is less (low values of n), (Section 8.4, Tables 2 and 3).
Conversly, the HMM takes great advantage of the temporal information and therefore its performance gets stack far below the EMbC and the EMC when such information is not present.Furthermore, it takes profit of the size of the data set when the space partition is clear (γ = 0.10), but when it is not, larger amounts of information do not guarantee better results because temporal evidence is unfaithful, (Section 8.4, Table 3).
Performance results aside, the EMC and the HMM are extremely sensible to the starting conditions and require multiple runs while the EMbC runs straightforwardly and keeps a high stability in the results with the lowest values of root mean square error (RMSE), (note the RMSE values in Section 8.4, Tables 2 and 3).
Figure 5 shows the EMbC output for an example of a synthetic trajectory (n = 800, γ = 0.0, no induced cluster separation).The clusters were defined with 96% accuracy with an F-measure value of 89% (Figure 5 bottom table).

Empirical trajectories
Four annotated empirical trajectories are shown (Figures 6,7,8,9), together with the velocity/turn scatter plot, the behavioural temporal profile, and the velocity and turning angle distributions.The turn distribution is depicted separately for low/high values of velocity in light/dark grey colours respectively, to show the existance (or not) of correlation among both variables.In Figure 6, the annotated trajectory is shown at the location level and bursted.For the rest of Figures we show bursted visualization and insets showing some trajectory detail.For all Figures , foraging (LH, red) and stoppover (LL, orange) areas are clearly identified and distinguished from relocation behaviour (HL, cyan).A similar partition layout and a similar semantic labelling can be applied regardless of the ecological context (migration/foraging), or the sampling resolution (Argos/GPS/HR-GPS).
The Cory's shearwater (Calonectris diomedea, Figure 6) dataset, a foraging trajectory, and the Osprey (Pandion haliateus, Figure 7) dataset, a migration trajectory, both result in standard velocity/turn partitions (Figures 6a, 7a), which may show some variation depending on the specific velocity and turning angle distributions, the marginal prior distribution of clusters, and the total amount of data.Within these standard layouts, the HH (blue) labelling is usually subject to more subtle semantic interpretation.In Figure 6a, the distribution of HH in the scatter plot suggests the existence of two posssible submodes, one more closely related to foraging (low velocity and wide turn range) and the other more closely related to relocation (high velocity and low turns).This could indicate that three clusters (LL,LH, and HL) would represent a better explanation of the behaviour of the animal but the likelihood payoff of this solution prevents the algorithm to reach it.Conversly, Figure 7a shows a more homogeneous HH cluster.A visual check of these set of points on a landscape map reveals that they correspond to long relocations within stopover areas, thus making them worth to be conformed in a different behaviour.
The Straw-coloured fruit bat (Eidolon helvum, Figure 8) dataset shows a different solution to the trade off between the final number of clusters and the overall likelihood of the partition (Figure 8a).The reason is two folded: on one hand the resting times of the daily cycle have been systematically removed from the sampling with a specific fixes schedulling; on the other hand, theres a strong correlation among large values of velocity and low values of turns.Therefore, the HH cluster is not present, and the LL cluster looks more as an artifact of the algorithm's attempt to identify a cluster that indeed does not exist.Curiosly, a few data points with large values of velocity, but also with high values of turn relative to the HL cluster are included in the LL cluster.By looking at the trajectory we observe that these data points seem to correspond to specific landmarks in the daily relocations of the animal (only at the start/end of the relocations, plus one systematicaly revisited intermediate point highlighted in Figure 8 inset (still pending)), suggesting that these points might have some biological relevance.The omission of this potential HL cluster is probably due to a spurious positive correlation that appears in the LL cluster conditioning its shape.
We used this dataset as an example to show the effect of the previously described post smoothing procedure.In Figure 8b we show the behavioural temporal profile (top line) compared to the smoothed temporal profile (center line), and signaling the differences (bottom line).As a result of the smoothing procedure, most LL labels turned into LH and HL and we obtained a final profile showing a regular LH/HL (foraging/relocation) pattern.However, some LL labels still remain, suggesting a quite systematic pattern on transient locations.Indeed, it is characteristic of the EMbC algorithm to capture behaviours showing strong correlations among movement variables despite being short in time and happening only intermittently.The decision on whether to consider or else to smooth out these type of behaviours relies on the expert decision.
The stork (Ciconia ciconia, Figure 9) dataset, a HR-GPS migration trajectory, shows how the potential of the velocity/turn clustering for behavioural annotation can be extended by including a simple covariate like the solar height [40] as a third variable.As a first advantage, a trivariate clustering has an increased capacity of identifying up to 8 different behaviours.Besides of this, including a solar position covariate (either height or azimuth) is the most natural means to account for time in the identification of behaviours.The counterpart is that the semantic interpretation must be done with care as not always low/high values of the covariate correspond to nocturnal/diurnal behaviours.
For high values of the covariate (mainly daytime) we got a standard layout partition (Figure 9b).The high density of this scatter plot serves to illustrate that maximum likelihood cluster partitions should not be considered sharp but fuzzy frontiers.Values conforming the frontier of a given cluster tend to be far from cluster centroids, which means that the probability of belonging to the cluster at either side of the border may be quite similar.In this case, the proposed smoothing procedure considering the temporal correlations of behavioural labels can be used to reassign labels which belong, with similar probability, to more than one cluster.Such procedure can yield more realistic behavioural segments and more refined cluster frontiers.
Conversly, nightime yielded more differentiated behaviours.(This is not yet finished)

Robustness to Data Loss
We also assessed the robustness to data loss of the EMbC algorithm.In general, the larger the dataset, the more robust is the EMbC labelling to data loss (Figure 10 a).However, the absence or presence of strong heterogeneities in the marginal distributions of clusters also plays a role.For example, although the Osprey and the Straw-colored fruit bat datasets are both small (n = 594 and n = 434, respectively) the former is more robust to data loss.Interestingly, Osprey data shows more uniform posterior marginal distributions of clusters (LL=17.51%,LH=35.02%,HL=29.97%,HH=17.34%)than the bat data (LL=10.60%,LH=43.09%,HL=46.08%,HH=0.00%).As it is a nocturnal bat, the daily resting in the roost was systematically (and intentionally) skipped by a pre-fixed sampling scheme, and therefore, the LL cluster commonly associated to resting behaviour is misrepresented (LL=10.68%)and indeed does not show such type of semantics.In general, pre-assigned GPS fixes schedules will bias the sampling distribution of behaviours, thus conditioning both the labelling outcome and the robustness of the results to data loss.

Robustness to Data Inaccuracy
Finally, we assessed the effect of accounting for data reliability in the robustness of the EMbC algorithm to data inaccuracy (i.e.sampling heterogeneity) (Figure 10 b).In high resolution GPS datasets (i.e.Cory's shearwater), the algorithm shows a strong robustness because innacuracies are expected to be low, and the effect of accounting for data reliability is almost non-significant.However, in ARGOS datasets with large sampling heterogeneities (i.e.Osprey), or trajectories with compelling structured heterogeneities (i.e. a prefixed sampling schedule with intentional data gaps as in the Straw-coloured fruit bat), the effect of accounting for data reliability is significant, and results in a larger robustness to data inaccuracy.

Discussion
Splitting a trajectory into its most basic components is essential for studying and understanding mechanisms of movement [1,41,42].Segmentation algorithms constitute an essential component of ecological spatio-temporal analyses that seek to mechanistically understand organisms' interactions with the environment.Current segmentation methods show gradients of complexity, supervision requirements, and sensitivity to initial conditions, as well as to sampling rate and data accuracy [5,17,18].We have developed an Expectation-Maximization algorithm seeking to minimize complexity, supervision, and sensitivity.A key concept of the EMbC algorithm, moreover, is to reach a compromise between interpretable behavioural trajectory segmentation and adequate statistical performance.The EMbC algorithm relies on a likelihood criterion to segment movement-related behavioural patterns, and assumes multivariate Gaussian clusters and a binary discretization of the variables.In terms of implementation, this implies iteratively computing the centroids of the clusters within regions determined by explicit delimiters (Equation 7) in order to provide easy and interpretable semantics (i.e.LL, HL, LH, HH).Nonetheless, at each iteration, we keep the computation of cluster covariances unbounded to incorporate information about the correlation landscape provided by the whole variable space (Equation 8).The choice of a binary partition of variables into H/L clusters, sets the maximal number of clusters restricted to 2 m , where m is the number of movement/behavioural variables used.This control over the number of clusters has more benefits than drawbacks when trying to establish semantics from the resulting clustering.
The use of a mixture model is well supported by results from density estimation theory stating that any continuous distribution can be effectively approximated by a mixture of Gaussians, given enough components and properly chosen parameters [43][44][45].
Although the EMbC algorithm might be improved, the underlying philosophy and its main architecture should remain valid.Improvements on the current EMbC implementation could come from exploring other reliability functions and their relative contribution to the final clustering.
The initial assumptions implemented in the EMbC algorithm aim at minimizing biases and sensitivity to initial conditions: (i) each data point is assigned a uniform probability of belonging to each cluster, (ii) the prior mixture distribution is uniform (each cluster starts with the same number of data points), (iii) the starting delimiters are chosen so as to convey the minimum information.A single variable specific parameter σ min limits the minimal resolution at which clusters will be accepted.In practical terms, σ min can be directly related to measurement errors (or maximum resolution) of the variables and will limit the minimum range of variability (i.e.minimum standard deviation) within the clusters obtained.Based on intuitive physical and biological considerations, we set σ min = 0.01 m/s and σ min = 0.087 rad (5 degrees), respectively for velocity and turn.
In the present work we show a bivariate version of the EMbC based on sampling-dependent velocity and turn measurements.However, the algorithm is multivariate and can deal with any type of movement/behavioural variables.For example, one could also use instantaneous speed [46] or accelerometry data [47,48].
Different layouts in the scatter plot emerge depending on the variable frequency distribution shapes and on the underlying mixture prior distribution.Determining the final behavioural semantics is species specific and depends on the variables used and on the resulting scatter layouts.
The algorithm deals very intuitively with data reliability: the larger the uncertainty associated with the values, the smaller the leverage of those values in the clustering.We consider two elementary sources of uncertainty: sampling heterogeneity (for those variables whose reliability depends on the sampling interval), and the measurement error of the devices.Depending on how the movement variables are computed, the associated uncertainty should be considered of one type or the other.In the present work we have computed velocity and turn based on the sampling intervals, and we have considered that the major source of error comes from sampling interval lengths.
Although the EMbC labelling is purely local, a temporal behavioural coherence emerges in the final output.The drawback of not explicitly considering the temporal correlations in the analysis is that, at coarse-scales, behavioural labels may not appear as continuous as desired, and some local labels may be considered neglectable or meaningless by the expert.We suggest considering the EMbC algorithm as a fine-scale behavioural segmentation method, with optional pre/post smoothing alternatives, the latter explicitly taking into account the weights of the labels relative to each cluster and their temporal dependencies in order to generate an aggregated coarse-grained behavioural annotation of the trajectory.
We also suggest the possibility of running the algorithm at the population level (by applying the algorithm on a pooled set of trajectories from a given population) to define average modes, that can be visualized in single trajectories.This approach smooths potential individual variability, which might be of non-interest for certain analyses.
In addition to the examples included in this work, we successfully applied the algorithm to many other trajectories involving different species, from albatrosses to nematodes.We can therefore ensure that the algorithm behaves reasonably well for a wide range of animals and behaviours.However, the degree of matching among the EMbC output and a particular set of behaviours will be highly dependent on the amount of information conveyed by the input variables regarding to these behaviours.Certainly it is possible to use personalized versions of the EMbC algorithm by choosing specific behavioural or physiological empirical data that correlates adequately with the behavioural qualification the data owner is looking for.
Finally, based on a set of synthetic trajectories, we have presented a comparison of the EMbC with similar state of the art algorithms like the EMC and the HMM, showing that the EMbC yields a similar degree of performance.Beyond this quantitative assessment, the aim was to clearly depict the use case of the EMbC.In summary, the method stands out with respect to EMC and HMM as long as we can assume that: (i) a binary discretization of the input feature space can be meaningful (i.e.we can expect some sort of bimodality in the conditional distribution of the variables) and, (ii) there is no certainty that a first order temporal dependence assumption can be undertaken (e.g.heterogeneous and/or large gaps in the time series, prefixed sampling schedulings).Additionally, the EMbC runs straightforwardly, it does not need multiple restarts nor initial parameterization.
With respect to other existing clustering methods (e.g.k-means clustering), it would not make much sense to perform standard comparisons in terms of accuracy, final likelihood or AIC/BIC values, as these are not the ultimate criterions driving the EMbC algorithm.
While a thorough comparison of methods is non-trivial (e.g.definition of performance measures, accounting for the real variability of datasets) and should involve a whole new manuscript, our strategy here is to release a ready-to-use tool (the EMbC R-package) and let the users decide whether and when the EMbC algorithm can be useful for them.
6 Tables Table 1.Empirical datasets.At each iteration step, the most common situation is that the delimiters do not determine a perfect partition of the variable space.We show two typical cases for the bivariate case with delimiters overlapping (left panel) and non-overlapping (right panel).Delimiters are shown as dashed lines (r .L , r L. ) and as dot-dashed lines (r .H , r H. ).Left: the data points in the middle red area may belong either to R LL or R HH , hence they are considered in the computation of both µ LL and µ HH .Right: with non-overlapping delimiters, we can still figure out an overlapping area between regions R LH and R HL by extending the delimiters (middle red area), such that data points in this area are considered in the computation of both µ LH and µ HL .Despite animals are sometimes equipped with special devices to register instantaneous velocity, we refer here to the more general case of estimating velocities by computing differences in successive locations.This computation can lead to unreliable information when time gaps in between locations are too long, but it is at this point when the implementation of uncertainty in the EMbC comes clearly useful.Generally, given two succesive points velocity is computed as, If projected coordinates are provided, then, given (t i , x i , y i ) , (t i+1 , x i+1 , y i+1 ), we can alternatively calculate velocity as, Note that the velocity assigned to any location is calculated with respect to the next location, not the previous one.This is important in order to correctly assign velocities to the limiting points of a period of relocation.

Turn
The variable turn, denoted by α i , is defined as the absolute value of the difference between the absolute direction at location i with respect to the previous direction, that is, Note that turns are never considered to be greater than π.While θ i (absolute direction at location i) is forwardly calculated, α i (turn at location i) is calculated with respect to the previous direction.
A special case is when two succesive points lie on the same coordinates and it is therefore impossible to determine an absolute direction.As explained in Section 8.1.2,we mark up this situation by θ i = 2π, and we assume that the animal was resting in the meantime.Consequently, whenever we have either θ i = 2π or θ i−1 = 2π, we force α i = 0.The latter expresses the idea that, after a period of resting, there is an absolute direction of departure but it makes no sense to compute a turn value.This convention is useful for the distinction of resting and foraging behaviors when velocity values are low.

Synthetic trajectories
Our aim in using synthetic trajectories was not only to numerically assess the performance of the EMbC algorithm, but also to compare it with related algorithms like the EMC and the HMM to clearly depict the gap the EMbC fills among these algorithms.Therefore, we generated a set of synthetic trajectories representing the use case of the EMbC, i.e. under the hypothesis of an underlying bivariate GMM with four components representing the four binary regions of a bivariate binary clustering (LL, LH, HL and HH).
In order to cover the broadest variability in the shape of underlying GMMs, while somehow preserving a match of the Gaussian components with a binary split of the bivariate (speed/turn) space, we generated GMM's by randomly sampling the set of parameters Θ = µ j , Σ j , π j , 1 ≤ j ≤ 4 as follows: The set Θ = µ j , Σ j , π j , 1 ≤ j ≤ 4 of GMM parameters is randomly sampled.In order to get more realistic partitions and consequently more realistic paths, the delimiters are initially set to a fourth of the distance between the means (coloured squares), as measured from the low mean, (r .L , r L. in grey dashed lines, and r .H , r H. in grey dot-dashed lines).The γ parameter defines the width of the boundary arround the delimiters (black dashed lines), that further delimit the binary regions.Data points are sampled from truncated Gaussians based on Θ and limited to the region boundaries.
• the set of priors π j was randomly drawn from a Uniform distribution bounded to the range 0.25 ≤ π j ≤ 0.75 to assure a minimum balancing in the representation of the four clusters.
• the covariance matrices Σ j were build by drawing values from a Beta(2,2) distribution compounded with a bounded Uniform(1,2) distribution for variances and a bounded Uniform(-4,0) for covariances, spanning the values of the variances as above.
We also wanted to have some control over the definition of the resulting clusters.To this aim the delimiters were set at a midpoint between the means of the four Gaussians.Afterwards, we forced a boundary of forbidden values arround the delimiters by means of a parameter γ = {0.01,0.05, 0.1} given as a fraction of the distance between the means (see Figure 13).Note that γ does not determine equal widths for all boundaries nor prevents potential overlapping of the binary regions thus contributing further randomness in the generation of our trajectories set.
The next step was to sample the sequence of states of the trajectory (the expected labels of the resulting clusters), for which we used two different sampling schemmes: 1. a Markov-chain based sampling schemme, that is, we randomly sample an extra set of parameters from a uniform distribution to define a multinomial transition matrix, sampling state labels based on that transition matrix; 2. a mixture prior based sampling schemme, that is, state labels were sampled based only on the prior parameters π j , 1 ≤ j ≤ 4.
Following the sequence of states, velocity/turn pairs were sampled from truncated Gaussians corresponding with the limits of each binary region using a Gibbs sampling algorithm.Also, a sequence of time intervals was randomly sampled from a N (1, 1) * 1800 seconds, i.e. setting trajectory locations at a mean time interval of half an hour.
Finally, given a starting location, the trajectories were generated computing the geolocations derived from the velocity/turn values and the time intervals at each time step.

Performance Measurement
We measured the performance of the algorithms using common measures of multiclass classification based on the confusion matrix.A confusion matrix is a specific table layout to visualize the performance of an algorithm.Columns represent the instances in a predicted class and rows represent the instances in an actual class.A straight measure of performance is accuracy, defined as the number of correctly classified instances (the sum of the values in the diagonal cells) with respect to their total number.But we are also interested in the analysis of performance at the class/cluster level.Measures defined at class level are recall, defined as the fraction of correctly classified instances (true positives) with respect to the total number of instances of that class (true positives plus false negatives), i.e., recall c = trueP ositives trueP ositives + f alseN egatives and precision, defined as the fraction of true positives with respect to the total number of instances predicted for that class (true positives plus false positives), i.e., precision c = trueP ositives trueP ositives + f alseP ositives As none of these is a reliable measure of performance by itself, it is common to use the F-measure, defined as the harmonic mean of recall and precision, i.e., Finally, a specific measure of performance for multiclass classification is the macro average F-measure, defined as the mean of the F-measures of each class, i.e., In contrast to accuracy, which is a measure that assigns equal weight to each instance, the macro average F-measure assigns equal weight to each class.We included both measures in the confusion matrix.The marginal distribution of clusters was also given in the last column of the confusion matrix, labeled as π c , (see Figure 3 in the main text as an example).

Performance Results
We show the performance values of the EMbC, EMC and HMM algorithms on two sets of synthetic trajectories, the first generated using a Markov-chain-based sampling of the states (Table 2), and the second using a prior-based sampling (Table 3).Values shown correspond to the average macro F-measure and RMSE (root mean squared error) for 100 randomly generated trajectories.The EMC and HMM implementations are highly dependent on a seed value that sets the initial conditions, thus we performed ten different runs for each trajectory.Results denoted as EMC.all and HMM.all correspond to the average of all ten runs over all 100 trajectories and those denoted as EMC.best and HMM.best correspond to the average of the single best run for all trajectories.Where specified, the value in parenthesis denote the percentage of different seed trials worked out without exceptions by the HMM algorithm.
are the components of the mean vector µ new j , and σ (r,s),new j (1 ≤ r ≤ m, 1 ≤ s ≤ m) are the variances and covariances of the covariance matrix Σ new j .

Figure 1 .Figure 2 .Figure 3 .
Figure 1.Cluster semantics.Comparison of the EMC (left) and EMbC (right) algorithms.Bivariate (velocity/turn) scatter-plots showing the clustering reached by each algorithm, corresponding to the same trajectory and exact initial conditions.Clusters are shown in different colours.In the right panel, the EMbC delimiters determining the final L/H binary regions are depicted as dashed lines (r .L , r L. ) and dot-dashed lines (r .H , r H. ).The centroids of each cluster are shown as black dots.Left: the EMC yields an output clustering that is difficult to link to a clear semantics.Right: the EMbC is driven by the delimiters, forcing the centroids to lay within the associated binary regions, yielding a final clustering that can be clearly interpreted in terms of L/H values of the variables (orange:LL, red:LH, cyan:HL and blue:HH).The matching among binary regions and clusters is not perfect because data-points are assigned to clusters depending on their weights, not on the delimiter values.In this case, the EMbC performs better (the clustering likelihoods are -3.3368 for the EMC and -3.2180 for the EMbC), but this result can not be generalized.

Figure 5 .Figure 6 .Figure 7 . 6 .Figure 8 .Figure 9 .Figure 10 .
Figure 5. Synthetic trajectory.Simulated trajectory with N = 400 and γ = 0.05.Panels: (a) trajectory grid plot, (b) temporal behaviour profile with output labelling (top), reference labelling (center) and differences between them (bottom), (c) reference velocity/turn scatter plot showing the limits of the binary regions (grey lines), (d) output velocity/turn scatterplot showing the resulting delimiters r .L , r L. (dashed lines) and r .H , r H. (dot-dashed lines), (e) velocity, and (f) turning angle frequency distributions.The turn distribution is depicted separately for low/high values of velocity in light/dark grey colours respectively, to show the existance (or not) of correlation among both variables.Bottom: Confusion matrix and performance measures.The last location of the trajectory is not classified.

Figure 11 .
Figure 11.Procedures for robustness tests.Procedures used in the robustness tests.a) Data-loss: subsampled datasets are generated by assigning a uniform random value 0 ≤ p i ≤ 1 to each data-point and removing all those points with p i < k dl , with 0 ≤ k dl ≤ 1 being the data loss factor, (in this example k dl = 0.2, removed points are marked as black dots).b) Data-inaccuracy: jittered datasets are generated by jittering the data-points using a noise function based on the associated uncertainties (Equations 9 and 10), (we depict some example points with different values of uncertainty u i and with k di = 0.05.)

Figure 12 .
Figure 12.Correction of ∆lon for the computation of the loxodromic distance and direction.

Figure 13 .
Figure13.Generation of synthetic GMMs and the corresponding binary regions.The set Θ = µ j , Σ j , π j , 1 ≤ j ≤ 4 of GMM parameters is randomly sampled.In order to get more realistic partitions and consequently more realistic paths, the delimiters are initially set to a fourth of the distance between the means (coloured squares), as measured from the low mean, (r .L , r L. in grey dashed lines, and r .H , r H. in grey dot-dashed lines).The γ parameter defines the width of the boundary arround the delimiters (black dashed lines), that further delimit the binary regions.Data points are sampled from truncated Gaussians based on Θ and limited to the region boundaries.

F c = 2
recall c precision c recall c + precision c

Table 2 .
EMbC vs. EMC and HMM.Performance measured with GMM-Markov-chain-based trajectories of different sample sizes n and different definition of the binary regions γ.

Table 3 .
EMbC vs. EMC and HMM.Performance measured with GMM-prior-based trajectories of different sample sizes n and different definition of the binary regions γ.