F-Formation Detection: Individuating Free-Standing Conversational Groups in Images

Detection of groups of interacting people is a very interesting and useful task in many modern technologies, with application fields spanning from video-surveillance to social robotics. In this paper we first furnish a rigorous definition of group considering the background of the social sciences: this allows us to specify many kinds of group, so far neglected in the Computer Vision literature. On top of this taxonomy we present a detailed state of the art on the group detection algorithms. Then, as a main contribution, we present a brand new method for the automatic detection of groups in still images, which is based on a graph-cuts framework for clustering individuals; in particular, we are able to codify in a computational sense the sociological definition of F-formation, that is very useful to encode a group having only proxemic information: position and orientation of people. We call the proposed method Graph-Cuts for F-formation (GCFF). We show how GCFF definitely outperforms all the state of the art methods in terms of different accuracy measures (some of them are brand new), demonstrating also a strong robustness to noise and versatility in recognizing groups of various cardinality.


Introduction
After years of research on automated analysis of individuals, the computer vision community has transferred its attention on the new issue of modeling gatherings of people, commonly referred as groups [23,17,16,24].called social signals [63], among which positional and orientational forms play a crucial role (cf.also [26], p. 11).In fact, the spatial position and orientation of people define one of the most important proxemic notions which describe an FCG, that is, the Adam Kendon's Facing Formation, mostly known as Fformation.
In Kendon's terms [39,14,38], an F-formation is a socio-spatial formation in which people have established and maintain a convex space (called o-space) to which everybody in the gathering has direct, easy and equal access.Typically, people arrange themselves in a form of a circle, ellipse, horseshoe, side-by-side or L-shape (cf.Fig. 4), so that they can have easy and preferential access to one another while excluding distractions of the outside world with their backs.Examples of F-formations are reported in Fig. 1.In computer vision, spatial position and orientational information can be automatically extracted, and these facts pave the way to the computational modeling of F-formation and, as a consequence, of the FCGs.Detecting free-standing conversational groups is useful in many contexts.In video-surveillance, automatically understanding the network of social relationships observed in an ecological scenario may result beneficial for advanced suspect profiling, improving and automatizing SPOT (Screening Passengers by Observation Technique) protocols [18], which nowadays are performed uniquely by human operators.
A robust FCG detector may also impact the social robotics field, where the approaches so far implemented work on few number of people, usually focusing on a single F-formation [32,65,51].
Efficient identification of FCGs could be of use in multimedia applications, and especially in semantic tagging [20,46], where groups of people are currently inferred by the proximity of their faces in the image plane.Adopting systems for 3D pose estimation from 2D images [3] plus an FCG detector could in principle lead to more robust estimations.In this scenario, the extraction of social relationships could help in inferring personality traits [54,27] and triggering friendship invitation mechanisms [40].
In computer-supported cooperative work (CSCW), being capable of automatically detecting FCG could be a step ahead in understanding how computer systems can support socialization and collaborative activities: e.g., [60,50,47,2]; in this case, FCGs are usually found by hand, or employing wearable sensors.
Manual detection of FCGs occurs also in human computer interaction, for the design of devices reacting to a situational change [29,55]: here the benefit of the automation of the detection process may lead to a genuine systematic study of how proxemic factors shape the usability of the device.
The last three years have seen works that automatically detect F-formations: Bazzani et al. [6] first proposed the use of positional and orientational information to capture Steady Conversational Groups (SCG); Cristani et al. [16] designed a sampling technique to seek F-formations centres by performing a greedy maximization in a Hough voting space; Hung and Kröse [31] detected F-formations by finding distinct maximal cliques in weighted graphs via graphtheoretic clustering; both the techniques were compared by Setti et al. [57].A multi-scale extension of the Hough-based approach [16] was proposed by Setti et al. [58].This improved on previous works, by explicitly modeling F-formations of different cardinalities.Tran et al. [62] followed the graph based approach of [31], extending it to deal with video-sequences and recognizing five kinds of activities.
Our proposed approach detects an arbitrary number of F-formations on single images using a monocular camera, by considering as input the position of people on the ground floor, and their orientation, captured as the head and/or body pose.The approach is iterative, and starts by assuming an arbitrarily high number of F-formations: after that, a hill-climbing optimisation alternates between assigning individuals to F-formations using the efficient graph-cut based optimisation [41], and updating the centres of the F-formations, pruning unsupported groups in accordance with a Minimum Description Length prior.The iterations continue until convergence, which is guaranteed.
As a second contribution, we present a novel set of metrics for group detection.This is not constrained to apply to FCG, but to any set of people considered as a whole, thus embracing generic group or crowd tracking scenarios [62].
The fundamental idea is the concept of tolerance threshold, which basically regulates the tolerance on individuating groups, allowing some individual com-ponents to be missed or external people to be added in a group.Thanks to the tolerance threshold, the concepts of tolerant match, tolerant accuracy and of precision and recall can be easily derived.Such measures take inspiration from the group match definition, firstly published in a previous work [16] and adopted in many recent group detection [62,58] and group tracking methods [7] so far: in practice, it corresponds to fix the tolerance threshold to a predefined value.
In this article, we show that, by letting the tolerance threshold change in a continuous way from maximum to minimum tolerance, it is possible to get an informative and compact measure (in the form of area under the curve) that summarises the behaviour of a given detection methodology.In addition, the tolerant match can be applied specifically to groups of a given cardinality, allowing to obtain specific values of accuracy, precision and recall; this highlights the performance of a given approach in a specific scenario, that is, the ability of capturing small or large groups of people.In the experiments, we apply GCFF to all publicly available datasets (see Fig. 2), consisting of more than 2000 different F-formations over 1024 frames.Comparing against the five most accurate methods in the literature we definitely set the best score on every dataset.In addition, using our novel metrics, we show that GCFF has the best behaviour in terms of robustness to noise, and it is able to capture groups of different cardinalities without changing any threshold.Summarising, the main contributions of this article are the following: • A novel methodology to detect F-formations from single images acquired by a monocular camera, which operates on positional and orientational information of the individuals in the scene.Unlike previous approaches, our novel methodology is a direct formulation of the sociological principles (proximity, orientation and ease of access) concerning o-spaces.The strong conceptual simplicity and clarity of our approach is an asset in two important ways: we do not require bespoke optimisation techniques, and we make use of established methods known to work reliably and efficiently.Second, and by far more important, the high accuracy and clarity of our approach, along with its basis in sociological principles makes it well suited for use in the social sciences as means of automatically annotating data.
• A rigorous taxonomy of the group entity, which takes from social science and illustrates all the different group manifestations, delineating their main characteristics, in order to go beyond the generic term of group, often misused in the computer vision community.
• A novel set of metrics for group detection, that for the first time models the fact that a group could be partially captured, with some people missing or erroneously taken into account, through the concept of tolerant match.The metrics can be employed to whatever approach involving groups (group tracking included).
The remainder of the paper is organised as follows: the next section presents a literature review of group modeling, with particular emphasis on the terminology adopted, which will be imported from the social and cognitive sciences; the proposed GCFF approach, together with its sociological grounding, is presented afterwards, followed by an extensive experimental evaluation.Finally, we will draw the conclusion and envisage the future perspectives. -

Literature Review
Research on group modeling in computer science is highly multidisciplinary, necessarily encompassing the social and the cognitive sciences when it comes to analyse human interaction.In this multifaceted scenario, characterising the works most related to our approach requires us to distinguish between related sociological concepts; starting with the Goffmanian [25] notions, of (a) "group" vs. "gathering", (b) "social occasion" vs. "social situation", (c) "unfocused" vs. "focused" interaction, and (d) Kendon's [37] specification concerning "common focused" vs. "jointly focused" encounters.
As mentioned in the introduction, groups entail some durable membership and organisation, gatherings consist of any set of two or more individuals in mutual immediate presence at a given moment.When people are co-present, they tend to behave like one who participates in a social occasion, and the latter provides the structural social context, the general "scheme" or "frame" of behaviour -like a party, a conference dinner, a picnic, an evening at the theatre, a night in the club, an afternoon at the stadium, a walk together, a day at the office, etc.-within which gatherings (may) develop, dissolve and redevelop in diverse and always different situational social contexts (or social situations, that is, e.g., that specific party, dinner, picnic, etc.) [26].
Unfocused interaction occurs whenever individuals find themselves by circumstance in the immediate presence of others.For instance, when forming a queue or crossing the street at a traffic light junction.On such occasions, simply by virtue of the reciprocal presence, some form of interpersonal communication must take place regardless of individual intent.Conversely, focused interaction occurs whenever two or more individuals willingly agree -although such an agreement is rarely verbalised-to sustain for a time a single focus of cognitive and visual attention [25].Focused gatherings can be further distinguished in common focused and jointly focused one [37].The latter entails the sense of a mutual, instead of merely common, activity; a preferential openness to interpersonal communication, an openness one does not necessarily find among strangers at the theatre, for instance; in other words, a special communication license, like in a conversation, a board game, or a joint task carried on by a group of face-to-face interacting collaborators.Participation, in other words, is not at all peripheral but engaged; people are -and display to be -mutually involved [26].All this can exclude from the gathering others who are present in the situation, as in any FCG at a coffee break with respect to the other ones.
Finally, we should consider the static/dynamic axis concerning the degree of freedom and flexibility of the spatial, positional, and orientational organisation of gatherings.Sometimes, indeed, people maintain approximately their positions for an extended period of time within fixed physical boundaries (e.g., during a meeting); sometimes they move within a delimited area (e.g., at a party); and sometimes they do within a more or less unconstrained space (for instance, people conversing while walking in the street).It is about a continuum, in which we can analytically identify thresholds.Tab. 1 lists some categorised examples of gatherings, considering the taxonomy axis "static/dynamic organisation" and the "unfocused/common-focused/jointly-focused interaction" one.Fig. 3 1: Gatherings categorisation on the basis of focus of attention and spatioproxemic freedom exemplified by typical social settings/situations.Within this taxonomy, our interest is on gatherings, formed by people jointly focused on interacting in a quasi-static fashion within a dynamic scenario.Kendon dubbed this scenario as characterising free-standing conversational groups, highlighting their spontaneous aggregation/disgregation nature, implying that Figure 3: Examples of gatherings categorised by focus of attention and spatio-proxemic freedom.Jointly focused, dynamic: our case, FCGs at a cocktail party; common focused, dynamic: a parading platoon; unfocused, dynamic: a queue at the airport; jointly focused, static: a meeting; common focused, static: people in a theatre stand; unfocused, static: persons in a waiting room.
their members are jointly focused, and specifying their mainly-static proxemic layout within a dynamic proxemic context.
The following review centres on the case of FCGs and their formation, while for the other cases we refer: with respect to computer vision, to [1] for generic human activity analysis, including single individuals, groups and crowds, and to [33] for a specific survey on crowds; with respect to the sociological literature, to [26] as for unfocused gatherings, to [37,56] as for common focused ones, and to [36,49] as for crowds in particular.
The analysis of focused gatherings in computer science had the first traces appearing in the field of human computer interaction and robotics, especially for what concerns context-aware computing, computer-supported cooperative work and social robotics [29,55,5,35].This happened since the detection of focused gatherings requires finer feature analysis, and in particular body posture inference other than positional cues extraction: these are difficult tasks for traditional computer vision scenarios, where people is captured at low resolution, under diverse illumination conditions, often partially or completely occluded.
In human-computer interaction, F-formation analysis encompasses contextaware computing, by considering spatial relationships among people where space factors become crucial into the design of applications for devices reacting to a situational change [29,55].In particular, Ballendat et al. [5] studied how proxemic interaction is expressive when considering cues like position, identity, movement, and orientation.They found that these cues can mediate the simultaneous interaction of multiple people as an F-formation, interpreting and exploiting people's directed attention to other people.So far, the challenge with these applications for researchers has been the hardware design, while the social dynamics are typically not explored.As notable exception, Jungman et al. [35] studied how different kinds of F-formations (L-shaped vs. face-to-face) identify different kinds of interaction: in particular, they examined whether or not Kendon's observation according to which face-to-face configurations are preferred for competitive interactions whereas L-shaped configurations are associated with cooperative interactions holds in gaming situations.The results partially supported the thesis.
In computer-supported cooperative work, Suzuki and Kato [60] described how different phases of collaborative working were locally and tacitly initiated, accomplished and closed by children by moving back and forth between standing face-to-face formations and sitting screen-facing formations.Morrison et al. [50] studied the impact of the adoption of electronic patient records on the structure of F-formations during hospital ward rounds.Marshall et al. [47] analysed through F-formations the social interactions between visitors and staff in a tourist information centre, describing how the physical structures in the space encouraged and discouraged particular kinds of interactions, and discussing how F-formations might be used to think about augmenting physical spaces.Finally, Akpan et al. [2], for the first time, explored the influence of both physical space and social context (or place) on the way people engage through F-formations with a public interactive display.The main finding is that social properties are more important than merely spatial ones: a conducive social context could overcome a poor physical space and encourage grouping for interaction; conversely, an inappropriate social context could inhibit interaction in spaces that might normally facilitate mutual involvement.So far, no automatic F-formation detection has been applied: positional and orientational information were analysed by hand, while our method is fully automated.
In social robotics, Nieuwenhuisen and Behneke presented Robotinho [51], a robotic tour guide which resembles behaviour of human tour guides and leads people to exhibits in a museum, asking people to come closer and arrange themselves in an F-formation, such that it can attend the visitors adequately while explaining an exhibit.Robotinho detects people by first detecting their faces, and using laser-range measurements to detects legs and trunks.Given this, it is not clear how proper F-formations are recognised.Robotinho essentially improves what has been done by Yousuf et al. [65], that develop a robot that simply detect when an F-formation is satisfied before explaining an exhibit.In this case, F-formations were detected automatically, using advanced sensors (range cameras, etc.) with the possibility of checking just one formation.In our case, a single monocular camera is adopted and the number of F-formations is not bounded.
In computer vision, Groh et al. [28] proposed to use the relative shoulder orientations and distances (using wearable sensors) between each pair of people as a feature vector for training a binary classifier, learning the pairwise configurations of people in a FCG and not.Strangely, the authors discouraged large FCG during the data acquisition, introducing a bias on their cardinality.With our proposal, no markers or positional devices have been considered, and entire FCGs of arbitrary cardinality are found (not pairwise associations only).In his previous work [6], one of the authors started to analyse F-formations by check-ing the intersection of the view-frustum of neighbouring people, where the view frustum was automatically detected by inferring the head orientation of each single individual in the scene.Under a sociological perspective, the head orientation cue can be exploited as an approximation of a person's focus of visual and cognitive attention, which in turn acts as an indication of the body orientation and the foot position, the last one considered as the most proper way to detect F-formations.Hung and Kröse [31] proposed to consider an F-formation as a dominant-set cluster [52] of an edge-weighted graph, where each node in the graph is a person, and the edges between them measure the affinity between pairs.Such maximal cliques has been defined by Pavan and Pelillo as dominant sets [52], for which a game theoretic approach has been designed to solve the clustering problem under these constraints.More recently, Tran et al. [62] applied a similar graph-based approach for finding groups, which were subsequently analysed by a specific descriptor that encodes people's mutual poses and their movements within the group gathering for activity recognition.In all these three approaches, the common underlying idea is to find set of pairs of individuals with similar mutual pose and orientation, thus considering pairwise proxemics relations as basic elements.This is weak, since in practice it tends to find circular formations (that is, cliques with compact structures), while FCGs have other common layouts (side-by-side, L-shape, etc.).In our case, all kinds of F-formations can be found.In addition, the definition of F-formation requires that no obstacles must invade the o-space (the convex space surrounded by the group members, see Fig. 1a): whereas in the above-mentioned approaches such a condition is not explicitly taken into account, it is a key element in GCFF.
In this sense, GCFF shares more similarities with the work of Cristani et al. [16], where F-formations were found by considering as atomic entity the state of a single person: each individual projects a set of samples in the floor space, that vote for different o-space centres, depending on his or her position and orientation.Votes are then accumulated in a proper Hough space, where a greedy minimization finds the subset of people voting for the same o-space centre, which in turns is free of obstacles.Setti et al. [57] compared the Houghbased approach with the graph-based strategy of Hung and Kröse [31], finding that the former performs better, especially when in presence of high noise.The study was also aimed at analysing how important positional and orientational information are: it turned out that, when in presence of positional information only, the performances of the Hough-based approach decrease strongly, while graph-based approaches are more robust.Another voting-based approach resembling the Hough-based strategy has been designed by Gan et al. [21], who individuated a global interaction space as the overlap area of different individual interaction spaces, that is, conic areas aligned coherently with the body orientations of the interactants (detected using a kinect device).Subsequently, the Hough-based approach has been extended for dealing with groups of diverse cardinalities by Setti et al. [58], who adopted a multi-scale Hough-space, and set the best performance so far.

Method
Our approach is strongly based on the formal definition of F-formation given by Kendon [38] (page 209 ): An F-formation arises whenever two or more people sustain a spatial and orientational relationship in which the space between them is one to which they have equal, direct, and exclusive access.
In particular, an F-formation is the proper organisation of three social spaces: o-space, p-space and r-space (see Fig. 4a).The o-space is a convex empty space surrounded by the people involved in a social interaction, where every participant is oriented inward into it, and no external people are allowed to lie.More in the detail, the o-space is determined by the overlap of those regions dubbed transactional segments, where as transactional segment we refer to the area in front of the body that can be reached easily, and where hearing and sight are most effective [13].In practice, in a F-formation, the transactional segment of a person coincides with the o-space, and this fact has been exploited in our algorithm.The p-space is the belt of space enveloping the o-space, where only the bodies of the F-formation participants (as well as some of their belongings) are placed.People in the p-space participate to an F-formation using the o-space to transmit their messages.The r-space is the space enveloping o-and p-spaces, and is also monitored by the F-formation participants.People joining or leaving a given F-formation mark their arrival as well as their departure by engaging in special behaviours displayed in a special order in special portions of r-space, depending on several factors (context, culture, personality among the others); therefore, here we prefer to avoid the analysis of such complex dynamics, leaving their computational analysis as future work.
F-formations can be organised in different arrangements, that is, spatial and orientational layouts (see Fig. 4a-d) [15,14,38].In F-formations of two individuals, usually we have a vis-a-vis arrangement, in which the two participants stand and face one another directly; another situation is the L-arrangement, when two people lie in a right angle to each other.As studied by Kendon [38], vis-a-vis configurations are preferred for competitive interactions, whereas Lshaped configurations are associated with cooperative interactions.In a sideby-side arrangement, people stand close together, both facing the same way; this situation occurs frequently when people stand at the edges of a setting against walls.Circular arrangements, finally, hold when F-formations are composed by more than two people; other than being circular, they can assume an approximately linear, semicircular, or rectangular shape.
GCFF finds the o-space of an F-formation, assigning to it those individuals whose transactional segments do overlap, without focusing on a particular arrangement.Given the position of an individual, to identify the transactional segment we exploit orientational information, which may come from the head orientation, the shoulder orientation or the feet layout, in increasing order of reliability [38].The idea is that the feet layout of a subject indicates the mean direction along which his messages should be delivered, while he is still free to rotate his head and to some extent his shoulders through a considerable arc, before he must begin to turn his lower body as well.The problem is that feet are almost impossible to detect in an automatic fashion, due to the frequent (auto) occlusions; shoulder orientation is also complicated, since most of the approaches of body pose estimation work on 2D data and do not manage autoocclusions.However, since any sustained head orientation in a given direction is usually associated with a reorientation of the lower body (so that the direction of the transactional segment again coincides with the direction in which the face is oriented [38]), head orientation should be considered proper for detecting transactional segments and, as a consequence, the o-space of an F-formation.In this work, we assume to have as input both positional information and head orientation; this assumption is reasonable due to the massive presence of robust tracking technologies [8] and head orientation algorithms [59,4,12].
In addition to this, we consider soft exclusion constraints: in an o-space, Fformation participants should have equal, direct and exclusive access.In other words, if person i stands between another person j, and an o-space centre O g of the F-formation g, this should prevent j from focusing on the o-space, and, as a consequence, from being part of the related F-formation.
In what follows, we formally define the objective function accounting for positional, orientational and exclusion constraints aspects, and show how it can be optimised.Fig. 5 gives a graphical idea of the problem formulation.

Objective Function
We use P i = [x i , y i , θ i ] to represent the position x i , y i and head orientation θ i of the individual i ∈ {1, . . ., n} in the scene.Let T S i be the a priori distribution which models the transactional segment of individual i.As we explained in the previous section, this segment is coherent with the position and orientation of the head, so we can assume T S i ∼ N (µ i , Σ i ), where µ i = [x µi , y µi ] = [x i + D cos θ i , y i + D sin θ i ], Σ i = σ • I with I the 2D identity matrix, and D is the distance between the individual i and the centre of its transactional segment (hereafter called stride).The stride parameter D can be learned by crossvalidation, or fixed a priori accounting for social facts.In practice, we assume the transactional segment of a person having a circular shape, which can be thought as superimposed to the o-space of the F-formation she may be part of.
O g = [u g , v g ] indicates the position of a candidate o-space centre for Fformation g ∈ {1, M }, while we use G i to refer to the F-formation containing individual i, considering the F-formation assignment G i = g for some g.The assignment assumes that each individual i may belong to a single F-formation g only1 at any given time, and this is reasonable when we are focusing one a single time, that is, an image.It follows naturally the definition of O Gi = [u Gi , v Gi ], which represents the position of a candidate o-space centre for an unknown F-formation G i = g containing i.
At this point, we define the likelihood probability of an individual i's transitional segment centre Hence, the probability that an individual i shares an o-space centre O Gi is given by and the posterior probability of any overall assignment is given by with C the random variable which models a possible joint location of all the o-space centres, O G is one instance of this joint location, and T S is the position of all the transitional segments of the individuals in the scene.
Clearly, if the number of o-space centres is unconstrained, the maximum a posteriori probability (MAP) occurs when each individual has his own separate o-space centre, generating a spurious F-formation formed by a single individual, that is, O Gi = T S i .To prevent this from happening, we associate a minimum description length prior (MDL) over the number of o-space centres used.This prior takes the same form as dictated by the Akaike Information Criterion (AIC) [10], linearly penalising the log-likelihood for the number of models used.
where |O G | is the number of distinct F-formations.
To find the MAP solution, we take the negative log-likelihood and discarding normalising constants, we have the following objective J(•) in standard form: As such, this can be seen as optimising a least-squares error combined with an MDL prior.In principle this could be optimised using a standard technique such as k-means clustering combined with a brute force search over all possible choices of k to optimise the MDL cost.In practice, k-means frequently gets stuck in local optima2 and instead we make use of the graph-cut based optimisation described in [41], and widely used in computer vision [9,45,11,64] In short, we start from an abundance of possible o-space centres, and then we use a hill-climbing optimisation that alternates between assigning individuals to o-space centres using the efficient graph-cut based optimisation [41] that directly minimises the cost (6), and then minimising the least squares component by updating o-space centres to the mean of O g , for all the individuals {i} currently assigned to the F-formation.The whole process is iterated until convergence.This approach is similar to the standard k-means algorithm, sharing both the assignment, and averaging step.However, as the graph-cut algorithm selects the number of clusters, we can avoid local minima by initialising with an excess of model proposals.In practice, we start from the previously mentioned trivial solution in which each individual is associated with its own o-space centre, centred on his position.

Algorithm 1 Finding shared focal centres
Initialise with

Visibility constraints
Finally, we add the natural constraint that people can only join an F-Formation if they can see the o-space centres.By allowing other people to occlude the o-space centre, we are able to capture more subtle nuances such as people being crowded out of F-formations or deliberately ostracised.Broadly speaking, an individual is excluded from an F-formation when another individual stands between him and the group centre.Taking θ g i,j as the angle between two individuals about a given o-space centre O g for which is assumed G i = G j = g and d g i , d g j as the distance of i, or j, respectively from the o-space centre O g , the following cost captures this property: ) and use the new cost function: R i,j (g i ) acts as a visibility constraint on i regardless of the group person j is assigned to, as such it can be treated as a unary cost or data-term and included in the graph-cut based part of the optimisation.Now we turn to other half of the optimisation -updating the o-space centres.Although, given an assignment of people to a o-space centre, a local minima can be found using any off the shelf non-convex optimisation, we take a different approach.There are two points to be aware of: first, the difference between J and J is sharply peaked and close to zero in most locations, and can generally be safely ignored; second and more importantly, we may often want to move out of a local minima.If updating an ospace centre results in a very high repulsion cost to one individual, this can often be dealt with by assigning the individual to a new group, and this will result in a lower overall cost, and more accurate labelling.As such, when optimising the o-space centres, we pass two proposals for each currently active model to graph-cuts -the previous proposal generated, and a new proposal based on the current mean of the F-formation.As the graph-cut based optimisation starts from the previous solution, and only moves to lower cost labellings, the cost always decreases and the procedure is guaranteed to converge to a local optimum.

Experiments
The experiments section contains the most exhaustive analysis of the group detection methods in still images carried so far in the computer vision literature, to the best of our knowledge.
In the preliminary part, we describe the five publicly available datasets employed as benchmark, the six methods taken into account as comparison and the metrics adopted to evaluate the detection performances.Subsequently, we start with an explicative example of how our approach GCFF does work, considering a synthetic scenario taken from the Synthetic dataset.The experiments continue with a comparative evaluation of GCFF on all the benchmarks against all the comparative methods, looking for the best performance of each approach.Here, GCFF definitely outperforms all the competitors, setting in all the cases new state-of-the-art scores.The ability of detecting groups of a given cardinality and a noise robustness analysis conclude the section, further promoting our technique.

Datasets
Five publicly available datasets are used for the experiments: two from [16] (Synthetic and Coffee Break ), one from [31] (IDIAP Poster Data), one from [58] (Cocktail Party), and one from [6] (GDet).A summary of the dataset features is in Table 2, while a detailed presentation of each dataset follows.All these datasets are publicly available and the participants to the original experiments gave their permission to share the images and video for scientific purposes.In Fig. 2, some frames of all the datasets are shown.A psychologist generated a set of 10 diverse situations, each one repeated with minor variations for 10 times, resulting in 100 frames representing different social situations, with the aim to span as many configurations as possible for F-formations.An average of 9 individuals and 3 groups are present in the scene, while there are also individuals not belonging to any group.Proxemic information is noiseless in the sense that there is no clutter in the position and orientation state of each individual.

IDIAP Poster Data (IPD) 4
Over 3 hours of aerial videos (resolution 654×439px) have been recorded during a poster session of a scientific meeting.Over 50 people are walking through the scene, forming several groups over time.A total of 82 images were selected with the idea to maximise the crowdedness and variance of the scenes.Images are unrelated to each other in the sense that there are no consecutive frames, and the time lag between them prevents to exploit temporal smoothness.As for the data annotation, a total of 24 annotators were grouped into 3-person subgroups and they were asked to identify F-formations and their associates from static images.Each person's position and body orientation was manually labelled and recorded as pixel values in the image plane -one pixel represented approximately 1.5cm.The difficulty of this dataset lies in the fact that a great variety of F-formation typologies are present in the scenario (other than circular, L-shapes, side-by-side are present).

Cocktail Party (CP) 5
This dataset contains about 30 minutes of video recordings of a cocktail party in a 30m 2 lab environment involving 7 subjects.The party was recorded using four synchronised angled-view cameras (15Hz, 1024 × 768px, jpeg) installed in the corners of the room.Subject's positions were logged using a particle filter-based body tracker [42] while head pose estimation is computed as in [43].Groups in one frame every 5 seconds were manually annotated by an expert, resulting in a total of 320 labelled frames for evaluation.This is the first dataset where proxemic information is estimated automatically, so errors may be present.Anyway, due to the highly supervised scenario, errors are very few.
Coffee Break (CB) 6The dataset focuses on a coffee-break scenario of a social event, with a maximum of 14 individuals organised in groups of 2 or 3 people each.Images are taken from a single camera with resolution of 1440 × 1080px.People positions have been estimated by exploiting multi-object tracking on the heads, and head detection has been performed afterwards [61], considering solely 4 possible orientations (front, back, left and right) in the image plane.The tracked positions and head orientations were then projected onto the ground plane.Considering the ground truth data, a psychologist annotated the videos indicating the groups present in the scenes, for a total of 119 frames split in two sequences.The annotations were generated by analysing each frame in combination with questionnaires that the subjects filled in.This dataset represent one of the most difficult benchmark, since the rough head orientation information, also affected by noise, gives in many cases unreliable information.Anyway, it represents also one of the most realistic scenario, since all the proxemic information comes from automatic, off/the/shelf, computer vision tools.

GDet 7
The dataset is composed by 5 subsequences of images acquired by 2 angled-view low resolution cameras (352 × 328px) a number of frames spanning from 17 to 132, for a total of 403 annotated frames.The scenario is a vending machines area where people meet and chat while they are having coffee.This is similar to Coffee Break scenario but in this case the scenario is indoor, which makes occlusions in this case many and severe; moreover, people in this scenario knows each other in advance.The videos were acquired with two monocular cameras, located on opposite angles of the room.To ensure the natural behaviour of people involved, they were not aware of the experiment purposes.Ground truth generations follows the same protocol as in Coffee Break; but in this case people tracking has been performed using the particle filter proposed in [42].Also in this case, head orientation was fixed to 4 angles.This dataset, together with Coffee Break, is the closest to what computer vision can give as input to our a FCG detection technique.

Alternative methods
As alternative methods, we consider all the suitable approaches proposed in the state of the art.Six methods are taken into account, one exploiting the concept of view frustum (IRPM [6]), two approaches based on dominant-sets (DS [31] and IGD [62]) and three different version of Hough Voting approaches using linear accumulator [16], entropic accumulator [57] and a multi-scale procedure [58].It follows a brief overview of the different methods -some of them being explained in the Introduction and in the Literature Review section.Please refer to the specific papers for more details about the algorithms.

Inter-Relation Pattern Matrix (IRPM)
Proposed by Bazzani et al. [6], it uses the head direction to infer the 3D view frustum as approximation of the Focus of Attention (FoA) of an individual; given the FoA and proximity information, interactions are estimated: the idea is that close-by people whose view frustum is intersecting are in some way interacting.

Dominant Sets (DS)
Presented by Hung and Kröse [31], this algorithm considers an F-formation as a dominant-set cluster [52] of an edge-weighted graph, where each node in the graph is a person, and the edges between them measure the affinity between pairs.

Interacting Group Discovery (IGD)
Presented by Tran et al. [62], it is based on dominant sets extraction from an undirected graph where nodes are individuals and the edges have a weight proportional to how much people are interacting.This method is similar to DS, but it differs in the way the weights of the edges in the graph are computed; in particular, it exploits social cues to compute this weight, approximating the attention of an individual as an ellipse centred at a fixed offset in front of him.
Interaction is based on the intersection of the attention ellipses related to two individuals: the more overlap between ellipses, the more they are interacting.

Hough Voting for F-formation (HVFF)
Under this caption, we consider a set of methods based on a Hough Voting strategy to build accumulation spaces and find local maxima of this function.
The general idea is that each individual is associated with a Gaussian probability density function which describes the position of the o-space centre he is pointing at.The pdf is approximated by a set of samples, which basically vote for a given o-space centre location.The voting space is then quantized and the votes are aggregated on squared cells, so to form a discrete accumulation space.Local maxima in this space identify o-space centres, and consequently, F-formations.The first work in this field is [16], where the votes are linearly accumulated by just summing up all the weights of votes belonging to the same cell.A first improvement of this approach is presented in [57], where the votes are aggregated by using the weighted Boltzmann entropy function.In [58] a multiscale approach is used on top of the entropic version: the idea is that groups with higher cardinality tends to arrange around a larger o-space; the entropic group search runs for different o-space dimensions by filtering groups cardinalities; afterwards, a fusion step is based on a majority criterion.

Evaluation metrics
As accuracy measures, we adopt the metrics proposed in [16] and extended in [57]: we consider a group as correctly estimated if at least (T • |G|) of their members are found by the grouping method and correctly detected by the tracker, and if no more than 1 − (T • |G|) false subjects (of the detected tracks) are identified, where |G| is the cardinality of the labelled group G, and T ∈ ]0, 1] is an arbitrary threshold, called tolerance threshold.In particular, we focus on two interesting values of T : 2/3 and 1.
With this definition of tolerant match, we can determine for each frame the correctly detected groups (true positives -TP), the miss-detected groups (false negatives -FN) and the hallucinated groups (false positives -FP).With this, we compute the standard pattern recognition metrics precision and recall: and the F 1 score defined as the harmonic mean of precision and recall: In addition to these metrics, we present in this paper a new metric which is independent from the tolerance threshold T .We compute this new score as the area under the curve (AUC) in the F 1 vs. T graph with T varying from 1/2 to 1 8 .We will call it Global Tolerant Matching score (GTM).Since in our experiments we only have groups up to 6 individuals, without loss of generality we consider T varying with 3 equal steps in the range stated above.
Moreover, we will discuss results also in terms of group cardinality, by computing the F 1 score for each cardinality separately and then computing mean and standard deviation.

An explicative example
Figure 6 gives a visual insight of our graph-cuts process.Given the position and orientation of each individual P i , the algorithm starts by computing the transitional segments C i .At the first iteration 0, the candidate o-space centres O i are initialized, and are coincident with the transitional segments C i ; in this example are present 11 individuals, so 11 candidate o-space centres are generated.After iteration 1, the proposed segmentation process provides 1 singleton (P 11 ) and 5 FCGs of two individuals each.We can appreciate different configurations such as vis-a-vis (O 1,2 ), L-shape (O 3,4 ) and side-by-side (O 5,6 ).Still, the grouping in the bottom part of the image is wrong (P 7 to P 10 ), since it violates the exclusion principle.In iteration 2, the previous candidate o-space centres is considered as initialization, and a new graph is built.In this new configuration, the group O 7,10 is recognized as violating the visibility constraint and thus the related edge is penalized; a new run of graph-cuts minimization allows to correctly cluster the FCGs in a singleton (P 10 ) and a FCG formed by three individuals (O 7,8,9 ), which corresponds to the ground truth (visualized as the dashed circles).

Best results analysis
Given the metrics explained above, the first test analyses the best performances for each method on each dataset; in practice, a tuning phase has been carried out for each method/dataset combination in order to get the best performances 9 .Best parameters (found on half of one sequence by cross-validation, and kept unchanged for the remaining datasets) are reported in Table 3. Please note, finding the right parameters can also fixed by hand, since the stride D depends on the social context under analysis (formal meetings will have higher D, the presence of tables and similar items may also increase the diameter of the FCGs): with a given D, for example, it is assumed that circular F-formations will have diameter of 2D.The parameter σ indicates how much we are permissive in accepting deviations from such a diameter.Moreover, D depends also on the different measure units (pixels/cm) which characterize the proxemic information associated to each individual in the scene.Table 3: Parameters used in the experiments for each dataset.These parameters are the results of a tuning phase and the difference are due to different measure units (pixels/cm) and different social environments (indoor/outdoor, formal/informal, etc.).

Dataset
Table 4 shows best results by considering the threshold T = 2/3, which corresponds to find at least 2/3 of the members of a group, no more than 1/3 of false subjects; while Table 5 presents results with T = 1, considering a group as correct if all and only its members are detected.The proposed method outperforms all the competitors, on all the datasets.With T = 2/3, three observations can be made: the first is that our approach GCFF improves substantially the precision (of 13% in average) and even more definitely the recall scores (of 17% in average) of the state of the art approaches.The second is that our approach produces the same score for both the precision and the recall; this is very convenient and convincing, since so far all the approaches of FCG detections have shown to be weak in the recall dimension.The third observation is that GCFF performs well both in the case where no errors in the position or orientation of the people are present (as the Synthetic dataset) and in the cases where strong noise of position and orientation is present (Coffee  Break, GDet).When moving to tolerance threshold equal to 1 (all the people in a group have to be individuated, and no false positive are allowed) the performance is reasonably lower, but the increment is even stronger w.r.t. to the state of the art, in general on all the datasets: in particular, on the Cocktail Party dataset, the results are more than twice the scores of the competitors.Finally, even in this case, GCFF produces a very similar score for precision and recall.
A performance analysis is also provided by changing the tolerance threshold T .Fig. 7 shows the average F 1 scores for each method computed over all the frames and datasets.From the curves we can appreciate how the proposed method is consistently best performing for each T -value.In the legend of Fig. 7 the Global Tolerant Matching score is also reported.Again, GCFF is outperforming the state of the art, independently from the choice of T .
The reason why our approach does better than the competitors has been explained in the state of the art section, here briefly summarized: the Dominant Set-based approaches DS and IGD, even if they are based on an elegant optimization procedure, tend to find circular groups, and are weaker in individuating other kinds of F-formations.Hough-based approaches HVFF X (X= lin, ent, ms) have a good modeling of the F-formation, allowing to find any shape, but rely on a greedy optimization procedure.Finally, IRPM approach has a rough modeling of the F-formation.Our approach viceversa has a rich modeling of the F-formation, and a powerful optimization strategy.

Cardinality analysis
As stated in [58], some methods are shown to work better with some group cardinalities.In this experiment, we sistematically check this aspect, evaluating the performance of all the considered methods in individuating groups with a particular number of individuals.Since Synthetic, Coffee Break and IDIAP Poster Session datasets only have groups of cardinality 2 and 3, we only focus on the remaining 2 datasets, which have a more uniform distribution of groups cardinalities.Tables 6 and 7 show F 1 scores for each method and each group cardinality respectively for Cocktail Party and GDet datasets.In both cases the proposed method outperforms the other state of the art methods in terms of higher average F 1 score, with very low standard deviation.In particular, only IRPM gives in GDet dataset results which are more stable than ours, but they are definitely poorer.

Noise analysis
In this experiment, we show how the methods behave against different degrees of clutter.For this sake, we consider the Synthetic dataset as starting point and we add to the proxemic state of each individual of each frame some random values based on a known noise distribution.We assume that the noise follows a Gaussian distribution with mean 0, and noise on each dimension (position, orientation) is uncorrelated.For our experiments we used σ x = σ y = 20cm and σ θ = 0.1rad.In our experiments, we consider 11 levels of noise L n = 0, . . ., 10, In particular, we produce results by adding noise on position only (leaving the orientation at its exact value), on orientation only (leaving the position of each individual at its exact value) and on both position and orientation.Fig. 8 shows F 1 scores for each method while increasing the noise level.In this case we can appreciate that with high orientation and combined noise IGD performs comparably or better than GCFF; this is a confirmation of the fact that methods based on Dominant Sets are performing very well when the orientation information is not reliable, as already stated in [57].

Conclusions
In this paper we presented a statistical framework for the detection of freestanding conversational groups (FCG) in still images.FCGs represent very common and crucial social events, where social ties (intimate vs. formal relationships) pop out naturally; for this reason, detection of FCGs is of primary importance in a wide spectra of application.The proposed algorithm is based on a graph-cuts minimization scheme, which essentially clusters individuals into groups; in particular, the computational model implements the sociological definition of F-formation, describing how people forming a FCG will locate in the space.The take-home message is that having basic proxemic information (people location and orientation) is enough to individuate groups with high accuracy.This claim originates from one of the most exhaustive experimental session implemented so far on this matter, with 5 diverse datasets taken into account, and all the best approaches in the literature considered as competitors; in addition to this, a deep analysis on the robustness to noise and on the capability of individuating groups of a given cardinality have been also carried out.The natural extension of this study consists in analyzing the temporal information, that is, video sequences: in this scenario, interesting phenomena such as entering or exiting a group could be considered and modeled, and the temporal smoothness can be exploited to generate even more precise FCG detections.

Figure 1 :
Figure 1: Examples of F-formations.a) in orange, the o-space; b) an aerial image of a circular F-formation; c) a party, something similar to a typical surveillance setting with the camera located 2-3 meters from the floor: detecting Fformations here is challenging.

Figure 2 :
Figure 2: Sample images of the four real-world datasets.For each dataset four frames are reported showing different situations of crowd and arrangement.
Vis-a-vis arrangement c) L-arrangement d) Side-by-side arrangement a) Circular arrangement

Figure 4 :
Figure 4: Structure of an F-formation and examples of F-formation arrangements.a) Schematization of the three spaces of an F-formation: starting from the centre, o-space, p-space and r-space.b-d) Three examples of Fformation arrangements: for each one of them, one picture highlights the head and shoulder pose, the other shows the lower body posture.For a picture of circular F-formation, see also Fig. 1.

Figure 5 :
Figure 5: Schematic representation of the problem formulation.Two individuals facing each other, the gray dot representing the transitional segment centre, the red cross being the o-space centre and the red area the o-space of the F-formation.

Figure 6 :
Figure 6: An explicative example.Iteration 0: initialization with the candidate o-space centres {O} coincident with the transitional segment of each individual {C}.Iteration 1: first graph-cuts run; easy groups are correctly clustered while the most complex still present errors (the FCG formed by P 7 and P 2 0 violates the visibility constraint).Iteration 2: the second graph-cuts run correctly detects the O 7,8,9 F-formation (at the bottom).Se text for more details.

Table 2 :
Summary of the features of the datasets used for experiments.

Table 4 :
Average precision, recall and F 1 scores for all the methods and all the datasets.(T = 2/3)

Table 5 :
Average precision, recall and F 1 scores for all the methods and all the datasets.(T = 1)