Intrinsic group behaviour: Dependence of pedestrian dyad dynamics on principal social and personal features

Being determined by human social behaviour, pedestrian group dynamics may depend on “intrinsic properties” such as the purpose of the pedestrians, their personal relation, gender, age, and body size. In this work we investigate the dynamical properties of pedestrian dyads (distance, spatial formation and velocity) by analysing a large data set of automatically tracked pedestrian trajectories in an unconstrained “ecological” setting (a shopping mall), whose apparent physical and social group properties have been analysed by three different human coders. We observed that females walk slower and closer than males, that workers walk faster, at a larger distance and more abreast than leisure oriented people, and that inter-group relation has a strong effect on group structure, with couples walking very close and abreast, colleagues walking at a larger distance, and friends walking more abreast than family members. Pedestrian height (obtained automatically through our tracking system) influences velocity and abreast distance, both growing functions of the average group height. Results regarding pedestrian age show that elderly people walk slowly, while active age adults walk at the maximum velocity. Groups with children have a strong tendency to walk in a non-abreast formation, with a large distance (despite a low abreast distance). A cross-analysis of the interplay between these intrinsic features, taking in account also the effect of an “extrinsic property” such as crowd density, confirms these major results but reveals also a richer structure. An interesting and unexpected result, for example, is that the velocity of groups with children increases with density, at least in the low-medium density range found under normal conditions in shopping malls. Children also appear to behave differently according to the gender of the parent.


Statistical analysis of observables
In this work we are interested in describing how pedestrian group behaviour is influenced by some intrinsic features, such as purpose, relation, gender, age or height. Each feature (or factor) may be divided in k categories (e.g., in the case of relation k = 4 and the categories are colleagues, couples, family and friends). Each group is coded as belonging to a specific category, so that each category has N k g groups. As described in Materials and methods, for each group i ∈ N k g we can measure the value of observable o every 500 ms. We may call these measurements o k i,j with j = 1, . . . , n k i (i.e. we have n k i measurements, or events, corresponding to group i in category k).
We believe that the largest amount of quantitative information regarding the dependence of group behaviour on intrinsic features is included in the overall probability distributions functions concerning all N k = i∈N k g n k i measurements of a given observable, as shown for example in Fig 2 in the main text, since from the analysis of these figures we can understand what is the probability of having a given value for each observable in each category.
It is nevertheless useful to extract some quantitative information, such as average values and standard deviations, from these distributions. Furthermore, although the purpose of this paper is not to provide a "p value statistical independence label" to each feature, to compare such average values it is customary and useful to compute, along with other statistical indicators such as effect size and determination coefficient, the standard error of each distribution and to perform the related analysis of variance (ANOVA). The computation of these latter statistical quantities is nevertheless based on an assumption of statistical independence of the data, an assumption that clearly does not hold for all our N k observations 1 .

Average values, standard deviations and standard errors
We thus proceed in the following way, justified by having a similar number of observation for each group 2 . For each observable o we compute the average over group i and then provide its average value in the category k as 1 As an extreme case, we can imagine that for a given k we were following a single group (N k g = 1) for one hour (n k 1 = 7200). We will have then, if we ignore measurement noise, a perfect information regarding the behaviour of that group in that hour and, under the strong assumption of time independence in the group behaviour, a good statistics about the behaviour of that particular group. We still do not have any information about how group behaviour changes between groups in the category, since that information depends on the number of groups analysed, N k g . Furthermore, since in general we track a given group only for the few seconds it needs to cross the corridor, the observations oi,j at fixed i are also strongly time correlated.
2 An average of 49 observations with a standard deviation of 22 over 1168 groups. We nevertheless exclude from the following analysis groups that provided less than 10 observation points.
where < O > and the standard error ε are given by and the standard deviation is As a rule of thumb, we may say that o assumes a different value between categories k and j if

Analysis of variance
This rule of thumb is obviously related to the ANOVA analysis reported in the text. The ANOVA analysis proceeds as follows. We define n c as the number of categories for a given feature, as the total number of groups, and the overall average of the observable as We then define the distance between < O > and < O > k as and the degrees of freedom The F factor is then defined as This result is reported in our tables as F γ 1 ,γ 2 , along with the celebrated p value, that provides the probability, under the hypothesis of independence of data, that the difference between the distributions is due to chance 3 The f distribution has to be computed numerically 4 , but a value F ≫ 1 assures a small p value. Let us see how this relates to the rule of thumb for standard errors. Let us assume we have two categories with the same number of groups for category We clearly have and Using we get the expression so that the rule of thumb eq. 6 corresponds to have an high F value and thus a low p value.

Coefficient of determination
Eq. 11 says that the F factor is high if the σ k are smaller than the d k , i.e. if the variation inside the categories are smaller than outside the category, and if the total number of observation is high. Due to the large number of data points, the F values in the "Statistical analysis of overall probability distributions" sections (where we use all the observable measurement instead of group averages) are always very high, and the corresponding p values very low, but the hypothesis of statistical independence of data underlying the usual interpretation of p is obviously not valid. There are nevertheless some statistical estimators that do not depend dramatically on the number of observations, and that will thus have a similar value either if performed using all the data points or if performed using only group averages.
One such estimator is the coefficient of determination which can also be computed as from the F factor as and provides an estimate of how much of the variance in the data is "explained" by the category averages.

Effect size
The R 2 coefficient may attain low values if two or more category distribution functions are very similar, as it usually the case in our work. To point out the presence of at least one distribution that is clearly different from the others we may use the following definition of the effect size δ. We first define 6 whereñ k ,ñ l are the number of points used for computing the averages and standard deviations 7 , and then we consider the maximum pairwise effect size While a p value tells us about the significance of the statistical difference between two distributions, the difference may be often so small that if can be verified only if a large amount of data are collected. But if we have also δ ≈ 1, then the two distributions are different enough to be distinguished also using a relatively reduced amount of data.

Multi-factor cross analysis
We refrain from applying the machinery of two way or n way ANOVA to our data, since our ecological data set is extremely unbalanced, and it is unbalanced for the very reason that our "factors" are not independent variables 8 . It is nevertheless useful to analyse the interplay between the different features, and we do that in the "Accounting for other effects" appendix by performing a statistical analysis similar to the one described above of a given feature A while keeping fixed the value of another feature B to a category k. 9 Sometimes this analysis is performed on a reduced number of groups, and thus the corresponding p value may be high. This does not imply that the analysis is valueless, at least in our opinion, since it provides new information. The F and p values are, in this situation, useful to compare different observables on the given condition. As an example, table 20 in appendix S3 tells us that x has a stronger variation between relation categories for fixed gender than r, and so on. Furthermore, in these situations, an analysis of statistical indicators that do not depend critically on the number of observations, such as the effect size, is particularly valuable.