Explaining the difference between men’s and women’s football

Women’s football is gaining supporters and practitioners worldwide, raising questions about what the differences are with men’s football. While the two sports are often compared based on the players’ physical attributes, we analyze the spatio-temporal events during matches in the last World Cups to compare male and female teams based on their technical performance. We train an artificial intelligence model to recognize if a team is male or female based on variables that describe a match’s playing intensity, accuracy, and performance quality. Our model accurately distinguishes between men’s and women’s football, revealing crucial technical differences, which we investigate through the extraction of explanations from the classifier’s decisions. The differences between men’s and women’s football are rooted in play accuracy, the recovery time of ball possession, and the players’ performance quality. Our methodology may help journalists and fans understand what makes women’s football a distinct sport and coaches design tactics tailored to female teams.


Introduction
Women's football took its first steps thanks to the independent women of the Kerr Ladies team, who gave the most significant impetus to this sport since the early twentieth-century [30].As time passed, the Kerr Ladies intrigued the English crowds for their ability to stand up to male teams in numerous charity competitions.The success and enthusiasm of these events aroused concerns within the English Football Association, which on December 5, 1921, decreed that "football is quite unsuitable for females and ought not to be encouraged", and requested "the clubs belonging to the Association to refuse the use of their grounds for such matches" [30].Unfortunately, this measure drastically slowed down the development of women's football, which, after a long period of stagnation, resurfaced in the first half of the 1960s in Europe's Nordic countries, such as Norway, Sweden, and Germany.From that moment on, the development of women's football was unstoppable, spreading to the stadiums of Europe and the world and carving out a notable showcase among the most popular sports in the world.From 2012 the number of women academies has doubled [18], with around 40 million girls and women playing football worldwide nowadays [26].
In the last decade, the attention around women's football has stimulated the birth of statistical comparisons with men's football [18,28,11].Bradley et al. [2] compare 52 men and 59 women, drawn during a Champions League season, and observe that women cover more distance than men at lower speeds, especially in the final minutes of the first half.However, at higher speed levels, men have better performances throughout the game [2].Sakamoto et al. [28] examine the shooting performance of 17 men and 17 women belonging to a university league, finding that women have lower average values than men on ball speed, foot speed, and ball-to-foot velocity ratio [28].Pedersen et al. [26] question the rules and regulations of the game and, taking into account the average height difference between 20-25 years-old men and women, estimate that the "fair" goal height in women's football should be 2.25 m, instead of 2.44 m.Gioldasis et al. [11] recruit 37 male and 27 female players from an amateur youth league and find that, while among male players, there is a significant difference between roles for almost all technical skills, among female players just the dribbling ability presents a significant difference.Sakellaris [29] finds that, in international football competitions, female teams have a higher average number of goals scored per match than their male counterparts.Finally, Lange et al. [18] follow 157 female and 207 male young Dutch footballers to investigate the tendency to stop the game to permit a teammate's or opponent's care on the ground, finding that women show, on average, a greater willingness to help.
An overview of the state of the art cannot avoid noticing that current studies focus on physical features and analyze small samples of male and female players using data collected on purpose.At the same time, although massive digital data about the technical behavior of players are nowadays available at an unprecedented scale and detail [24,22,13,7,1,27], investigations of the differences between women's and men's football from a technical point of view are still limited.Is the intensity of play in women's matches higher than men's ones?Are women more accurate than men in passing?Furthermore, does the statistical distribution of male players' performance quality differ from that of female players?
In this article, we analyze a large dataset describing 173k spatio-temporal events that occur during the last men's and women's World Cups: 64 and 44 matches, respectively, and 32 men's and 24 women's teams with 736 male players and 546 female players.To the best of our knowledge, ours is the largest sample of men's and women's football matches and players.We quantify players' and teams' performance in several ways, from the number of game events generated during a match to the proportion of accurate passes, the velocity of the game, the quality of individual performance, and teams' collective behavior.We then tackle the following interesting question: Can a machine distinguish a male team from a female based on their technical performance only?
Based on the use of a machine learning classifier, we show that men's and women's football do have apparent differences, which we investigate through the extraction of global and local explanations from the classifier's decisions.Opening the classifier's black box allows us to reveal that, while the intensity of the game is similar, the differences between men's and women's football are rooted in play accuracy, time to recover ball possession, and the typical performance quality of the players.
Our methodology is useful to several actors in the sports industry.On the one hand, a deeper understanding of female and male performance differences may help coaches and athletic trainers design training sessions, strategies, and tactics tailored for women players.On the other hand, our results may help sports journalists tell and football fans understand what makes women's football a distinct sport.

Football Data
We use data related to the last men's World Cup 2018, describing 101,759 events from 64 matches, 32 national teams and 736 players, and the last women's World Cup 2019, with 71,636 events from 44 matches, 24 national teams and 546 players.Each event records its type (e.g., pass, shot, foul), a time-stamp, the player(s) related to the event, the event's match, and the position on the field, the event subtype and a list of tags, which enrich the event with additional information [24] (see an example of event in Table 1).Events are annotated manually from each match's video stream using proprietary software (the tagger) by three operators, one operator per team and one operator acting as responsible supervisor of the output of the whole match.The dataset regarding the men's World Cup 2018 have been publicly released recently [25], in companion with a detailed description of the data format, the data collection procedure, and its reliability [24,23].Match event streams are nowadays a standard data format widely used in sports analytics for performance evaluation [23,7,22,8] and advanced tactical analysis [9,5,15].Figure 1a shows some events generated by a player in a match.Figure 1b shows the distribution of the total number of events in our dataset: on average, a football match has around 1600 events, whereas a couple of matches have up to 2200 events.

Technical Performance
Do technical characteristics of men's and women's football significantly differ, statistically speaking?To answer this question, we define variables that describe relevant technical aspects of the game and show for which of them there is statistical difference between men and women.In particular, we investigate three technical aspects: (i) intensity of play (Section 3.1); (ii) shooting distance (Section 3.2); and (iii) performance quality (Section 3.3).

Intensity of play
The intensity of play is associated with a team's chance of success [5,6].Here, we measure intensity of play in terms of volume and velocity.
Volume.For each team in a match, we compute the total number of events and the number of specific event types (duels, fouls, free kicks, offsides, passes and shots) [24].Although, on average, men's matches show more events that women's ones, this difference is not statistically significant (unpaired t-score = 1.40, p-value = 0.16, see Table 2).Women's matches have, on This plain "geo-referenced" visualization of events allow understanding how to reconstruct the player's behavior during the match(b) Distribution of the number of events per match.On average, a football match in our dataset has 1600 events.
{ " eventName " : " Pass " , " eventSec " : 2 .4 1 , " playerId " : 3 3 4 4 , " matchId " : 2 5 7 6 3 3 5 , " teamId " : 3 1 6 1 , " positions " : [ { " x " : 4 9 , " y " : 5 0 } ] , " subEventName " : " Simple pass " , " tags " : [ { " id " : 1 8 0 1 } ] } Table 1: Example of event corresponding to an accurate pass.eventName indicates the name of the event's type: there are seven types of events (pass, foul, shot, duel, free kick, offside and touch).eventSec is the time when the event occurs (in seconds since the beginning of the current half of the match); playerId is the identifier of the player who generated the event.matchId is the match's identifier.teamId is team's identifier.subEventName indicates the name of the subevent's type.positions is the event's origin and destination positions.Each position is a pair of coordinates (x, y) in the range [0, 100], indicating the percentage of the field from the perspective of the attacking team.tags is a list of event tags, each describing additional information about the event (e.g., accurate).A thorough description of this data format and its collection procedure can be found in [24].average, more free kicks, duels, others on the ball (i.e., accelerations, clearances and ball touches) and passes but fewer fouls than men's matches (Table 2).Additionally, men's passes are also on average more accurate than women's ones (unpaired t-score = 8.95, p-value < 0.001).
Velocity.The average pass velocity PassV(g) measures the average time between two consecutive passes in a match g, and the average ball recovery time RecT(g) measures the average time for a team to recover ball possession in g (see Supplementary Information 1).The interruption time StopT(g) indicates the time spent between two consecutive actions (i.e., time to make a free-kick, a corner kick or a throw-in).The average pass length PassL(g) measures the average time between a team's two consecutive shots in a match and the average distance between a pass's starting and ending points, respectively.For all of these features, we perform an unpaired t-test to detect differences between men and women (Table 2).We find that women's PassV(g) (unpaired t-score = 8.69, p-value < 0.001) is lower than men's one, denoting a higher velocity of passes in men's football (unpaired t-score = 3.540, p-value < 0.001).At the same time, women's RecT(g) is lower than male's one (unpaired t-score = 5.41, p-value < 0.001), i.e., women regain ball possession faster.In contrast, men's passes PassL(g) (unpaired t-score = 3.54, p-value < 0.001) are on average larger than women's ones.2: Statistical difference of technical features between male and female teams.The summary data for both women and men are report as mean±standard deviation per matches.Grey rows indicates features for which the difference between men and women is statistically significant.The highest values are highlighted in bold.

Shooting distance
We explore the spatial distribution of the positions where male and female players perform free kicks and shots (see Supplementary Figure 10) and quantify shooting distance ShotD as the Euclidean distance from the position where the shots starts to the center of the opponents' goal.To find statistical difference between men and women, we use the non-parametric Mann-Whitney U-Test.On average, men players kick the ball from a greater distance than women (p-value < 0.001, Table 2).
To take into account that men and women may have a different perception of distance to the opponents' goal, we split the attacking midfield into three zones Z1, Z2 and Z3, according to the two distributions of shooting distance, i.e., looking at a shot's minimum and the maximum starting positions.Z1 is the area closest to the goal, Z3 the furthest, Z2 the zone in the middle.The zones of women are 1.1 meters closer to the goal than the zones of men (p-value < 0.001).
We then use a z-test for proportions with two independent samples to verify whether there is a difference in the shooting activity between men and women.Female teams have a higher percentage of shots from their Z1 zone than male teams (p-value = 0.01); the opposite is true in the Z2 shooting area (p-value = 0.004).Finally, female teams have a higher percentage of shots from their Z3 shooting area (p-value = 0.02) than male teams.

Performance quality
We use the PlayeRank algorithm [23] to compute the PR score, which quantifies a player's performance quality in a match (see Supplementary Information 2 for details on the algorithm).PlayeRank is robust in agreeing with a ranking of players given by professional football scouts, given its capability of describing football performance comprehensively [23].For each match g, and for both teams, we compute the mean and the standard deviation of the individual PR scores, PR avg (T, g) and PR std (T, g), respectively.High values of PR avg (T, g) indicate that the players in team T perform well in match g, on average.High values of PR std (T, g) indicate a large variability of PR across the teammates in match g.Male players have higher PR avg than females players (unpaired t-score = 9.01, p<0.001) but similar PR std (unpaired t-score = -0.40,p-value = 0.69).We find statistical difference in the PR score between men and women for left fielders only (Figure 2).
We also explore the differences in the collective behavior of male and female teams computing the passing networks, graphs in which nodes are players and edges represent passes between teammates in a match [5,10,19,4,3].From the passing network of a team T in a match g we derive the H indicator H(T, g) [5,6] and the team flow centrality FC(T, g) [10], two ways of quantifying the goodness of a team's performance in a match [24] (Supplementary Information 3).H(T, g) summarizes different aspects of a team's passing behaviour, such as the average amount µ p of passes and the variance σ p of the number of passes Figure 2: PlayeRank score by role fro male and female players.Asterisks indicate significant statistical difference between male and female for that role.[5].The higher the σ p , the higher is the heterogeneity in the volume of passes managed by the players.A player's flow centrality in a match is defined as their betweenness centrality in the passing network [10].The team flow centrality, FC(T, g), is hence defined as the average of the flow centralities of players of team T in match g [10].
Table 3 shows the top ten male and female teams with highest average H indicator H avg , the average PR score PR avg (T ), and average FC score FC avg (T ).Spain is the male team with the best overall team performance (H (M ) avg (Spain) = 1.67), and so is Japan in the women's World Cup (H (F ) avg (Japan) = 1.56).In general, the H indicator of male teams (H (M ) avg = 1.32) is higher (unpaired t-score = 2.67, p<0.02) than female teams' one (H

In Summary
Our statistical analysis reveals that male and female teams do differ in many technical characteristics (  4: Table of the leave-one-team out cross-validation results (i.e., Accuracy, Precision, Recall and F1-score) computed on the training dataset of each machine learning classifiers used to predict a football team in a game as male (class 0) or female (class 1).The baseline classifier always predicts by respecting the training set's class distribution, which is balanced.The percentages in the table refer to the improvement of machine learning model compared to the baseline results.
• Men perform more passes per match with a higher accuracy indicating a higher volume of play and a better technical quality of the men compared to woman; • Men perform longer passes and shoot from a longer distance than women, presumably due to the physical differences between genders (e.g., men have greater strength in the legs, which allows them to shoot from farther away); • The typical performance quality of male teams, in terms of pass volume, heterogeneity, centrality and PR score, is higher than women's one.This result could be related to the different player style; • Women's ball recovery time is shorter than men's, denoting either a better capability of women to recover ball or a lower capability to retain it, and characterizing a more fragmented game in women's football.

Team gender recognition
Having established that women's and men's football differ in many technical characteristics related to intensity of play, shooting distance, and performance quality, we now turn to the question: Can we design a machine learning classifier to distinguish between a male and a female football team?Machine learning can capture the interplay between technical features, and explanations extracted from the constructed classifier can reveal further insights on the differences between men and women football [14].
As a first step, we describe the behavior of a team T in match g by a performance vector of 19 variables and associate it with a target variable: • number of events (# events) and number of events of each type (# shots, # fouls, # passes, # free kicks, # duels, # offside, # others, # accurate passes); • percentage of accurate passes AccP, average shooting length ShootL and average pass length PassL; • average time between passes PassV; • average time to regain ball possession RecT and how long a team takes before re-starting the game through a free-kick, a corner kick or a throw-in StopT; • the H-indicator H, the team flow centrality FC, the average PR score PR avg and its standard deviation PR std .
• the target variable indicates whether the team is male (class 1) or female (class 0).
We build a supervised classifier and use 20% of the dataset to tune its hyper-parameters through a grid search with 5-folds cross validation.We use the remaining 80% of the dataset to validate the model using a leave-one-team-out cross-validation: in turn, we leave out all matches of one team and train the model using all matches of the remaining teams.We assess the performance of the model using four metrics [16]: (i) accuracy, the ratio of correct predictions over the total number of predictions; (ii) precision, the ratio of correct predictions over the number of predictions for the positive class (male); (iii) recall, the ratio of correct predictions over the total number of instances of the positive class (male); (iv) F1 score, the harmonic mean of precision and recall.
We try several learners to construct different types of classifiers (Decision Tree, Logistic Regression, Random Forest, and AdaBoost).All classifiers achieve a good performance (see Supplementary Figure 9), with an average relative improvement of 67% in terms of F1-score over a classifier that always predicts the team's gender randomly (Table 4).The best model, AdaBoost, has an improvement of 93% over the baseline in terms of F1-score.These results indicate that a classifier can distinguish between male and female teams on the only basis of the performance variables.
The inspection of the reasoning underlying the model's decisions can provide us deeper insights into the differences between men's and women's football.We extract global (i.e., inference on the basis of a complete dataset) and local (i.e., inference about an individual prediction) explanations from the best model (AdaBoost) using SHAP1 , a method to explain each prediction based on the optimal Shapley value [20].The Shapley value of a performance variable is obtained by composing a combination of several variables and average change depending on the presence or absence of the variables to determine the importance of a single variable based on game theory [20].The interpretation of the shapley value for variable value j is: the value of the j-th variable contributed φ j to the prediction of a particular instance compared to the average prediction for the dataset [21].
Figure 3 shows the global explanation of AdaBoost, in which variables are ranked based on their overall importance to the model in accordance with shap values.Pass accuracy (AccP) is way far the most important feature to classify a team's gender.Recovery time (RecT), average interruption time (StopT), pass velocity (PassV), pass length (PassL), # duels and # passes, PR avg , # fouls and PR std are other important features for the decision making process.
Figure 4 shows the summary plot that combines feature importance and feature effects, where each point indicates a team.The position of a feature on the y-axis indicates the importance of that feature to the model's decision.A point's color, in a gradient from blue (low) to red (high), indicates its numerical value.The position of a point on the x-axis indicates the associated shap value: positive values indicate that a team is more likely to be male; negative values that it is more likely to be female.Higher values of PassAcc (red points) are associated with higher shap values.This indicates that male players are typically more accurate in passing, a property that is used by the classifier to discriminate a male team from a female one.Similarly, high values of RecT are associated with a higher probability of a team to be male, highlighting a fortiori that female teams are characterized by a more fragmented play.2) than to those of a male one (AccP (M ) = 0.84, PassV (M ) = 2.99, Table 2).Overall, the sum of the shap values indicates that France played a match in accordance with the typical characteristics of a male team.Figure 5b shows the prediction of a match in the women's World Cup 2019, USA vs Spain.In this case, AdaBoost correctly predicts that USA is a female team, basing its decision mainly on AccP, PR std , StopT, RecT, and PassV.USA has RecT(USA, USA vs SPA) = 28.94 and StopT(USA, USA vs SPA) = 30.34,closer to the typical values of men's football (RecT (M ) = 27.32 and StopT (M ) = 23.27,Table 2) than to those a women's football (RecT (M ) = 19.58 and StopT (M ) = 18.92,Table 2).In contrast, the values of AccP(USA, USA vs SPA) = 0.81 and PassV(USA, USA vs SPA) = 2.83, more similar to those of women teams (Table 2).Overall, the sum of the shap values leads the model to classify US as a female team.
Figure 6a and 6b visualize the predictions of the AdaBoost classifier on a test set of 31 men's matches and and 21 women's matches concerning the two most important variables, AccP and RecT.In just two cases out of 21, AdaBoost misclassifies a female team as a male one (Figure 6b).For example, in match Brazil vs France of the women's World Cup, RecT(Brazil, BRA vs FRA) = 35.89and AccP(Brazil, BRA vs FRA) = 0.75 (Figure 6c), which leads the model to misclassify it as a male team because those values are more typical of women's football than of men's football.
In just three cases out of 31, a male team is misclassified as a female one (Figure 6a, red crosses).For example, in match Sweden vs Mexico of the men's World Cup, Mexico is correctly classified as a male team: its values of AccP(Mexico, SWE vs MEX) = 0.85 and RecT(Mexico, SWE vs MEX) = 30 are indeed close to the typical values of men's football.In contrast, in match Germany vs. South Korea, Germany is misclassified as a female team, mainly because RecT(Germany, GER vs KOR) = 20.31makes it more similar to a female team (RecT (F ) = 19.58)than to a male one (RecT (M ) = 27.32,see Table 2 and Figure 6d).

Conclusions
The availability of spatio-temporal match events related to the last men's and women's World Cups allowed us to compare the technical characteristics of men's and women's football.While most of the existing works focus on the differences in physical characteristics, we reconstructed a complex mosaic of the differences between male and female players.Our statistical analysis revealed that differences do exist in several technical features: the time between two consecutive events and the time required to recover possession are the lowest in women's football; conversely, male teams are typically more accurate in passing, and they kick the ball from a greater distance than women players.The inspection, through global and local explanations, of a model that classifies team gender from the technical features, confirmed that the percentage of accurate passes and the time to recover possession are crucial to distinguish between the two sports.In particular, the usage of the local explanations provide a novel perspective to reason about the difference between men and women in football, highlighting the reason behind the peculiar cases in which the classifier has been "fooled" by a team's technical performance.
Our results are open to various interpretations.First of all, the statistical non-significance of the difference in the number of events and shots suggest that, overall, men's and women's football have similar play intensity.Conversely, the higher accuracy of passes in men's matches may be due to the higher technical level of male players, which may be rooted in the fact that national teams in the men's World Cup are mainly composed of professional players.In contrast, several female national teams (e.g., Italy) are composed of non-professional players or professional players for a short time.Although women's football's technical level is increasing rapidly, there is still a technical gap between the two sports.The shorter recovery time observed for women's matches may be due to both the lower pass accuracy (i.e., more balls lost) and a better capacity of women to press the opponents and recover ball possession.Performance indicators reveal that centrality is higher in men's football, denoting the presence of "hub" players that centralize the game (higher flow centrality) and higher variability in the performance quality across teammates (higher H indicator and PR score).This suggests that women's football passes are more uniformly distributed across the teammates.Women's football also has a preference for short passes over long balls.Since accurate long balls are harder than short ones, this preference may be a solution to compensate for women players' lower technical level.
As future work, we plan to investigate differences in men's and women's football in national tournaments, and to investigate to what extent these differences vary nation by nation and between national and continental competitions.Are the difference we found in this paper more marked in the longer competitions for clubs?

Figure 1 :
Figure 1: (a) Example of events observed for a player in our dataset.Events are shown at the position where they have occurred.This plain "geo-referenced" visualization of events allow understanding how to reconstruct the player's behavior during the match(b) Distribution of the number of events per match.On average, a football match in our dataset has 1600 events.

Figure 3 :
Figure 3: Ranking of features importance (mean Shap value) extracted from the team gender classifier.

Figure 4 :
Figure 4: Distribution of the impact of each feature on the team gender classifier.The color represents the feature value (red high, blue low); and position of the point indicates the Shap value.

Figure 5 :
Figure 5: Local Shap explanations for two examples in our dataset: France in France vs Croatia and USA in match USA vs Spain.Feature values that increase the probability of a team to be male are shown in red, those decreasing the probability are in blue.

Figure 6 :
Figure 6: (a, b) Scatter plots displaying AccP versus RecT for a test set of teams in male matches (a) and female matches (b).Circles indicate a team correctly classified by the team gender classifier, crosses indicate a mistake by the classifier.The dashed lines are at the median values for the two variables over the entire data set.In plots (c) and (d) we report the local Shap explanations of two misclassified examples.

Figure 7 :
Figure 7: Heatmaps describing the pitch zones from where the free-kick shots and the shots in motion were more frequently made by male and female players, during their respective World Cup championships.They show the kernel estimate of the First Grade Intensity function λ(s), where the event points s i are the free-kick shots ((a) and (b)) and the shots in motion((c) and (d)), and the football field is the region of interest R. The darker is the green, the higher is the number of free-kick shots and shots in motion in a specific field zone.The pitch length (x) and width (y) are in the range [0,100], which indicates the percentage of the field starting from the left corner of the attacking team.

Table 3 :
List of the top ten football teams with the highest average H, FC, and PR indicators in the two competitions.