Evaluation of soccer team defense based on prediction models of ball recovery and being attacked: A pilot study

With the development of measurement technology, data on the movements of actual games in various sports can be obtained and used for planning and evaluating the tactics and strategy. Defense in team sports is generally difficult to be evaluated because of the lack of statistical data. Conventional evaluation methods based on predictions of scores are considered unreliable because they predict rare events throughout the game. Besides, it is difficult to evaluate various plays leading up to a score. In this study, we propose a method to evaluate team defense from a comprehensive perspective related to team performance by predicting ball recovery and being attacked, which occur more frequently than goals, using player actions and positional data of all players and the ball. Using data from 45 soccer matches, we examined the relationship between the proposed index and team performance in actual matches and throughout a season. Results show that the proposed classifiers predicted the true events (mean F1 score > 0.483) better than the existing classifiers which were based on rare events or goals (mean F1 score < 0.201). Also, the proposed index had a moderate correlation with the long-term outcomes of the season (r = 0.397). These results suggest that the proposed index might be a more reliable indicator rather than winning or losing with the inclusion of accidental factors.


Introduction
The development of measurement technology has allowed for the generation of data on the movements in various sports games for use in planning and evaluating the tactics and strategy. For example, tracking data during a game of soccer, including the positional data of the players and ball, is commonly used for players' conditioning (e.g., running distance or the number of sprints) [1,2]. However, during a soccer match, all 22 players and the ball interact in complex ways for scoring goals or preventing being scored (it is sometimes referred to as conceding) for each team. Hence, it is then necessary to evaluate the performance of not only individuals but also the entire team [3].In particular, defensive tactics are considered difficult to evaluate because of the limited amount of available statistics, such as goals scored in the case of attacks.
There are three main approaches to quantitatively evaluate teams and players in soccer, mainly from an attacking perspective. Note that the evaluation of defense is almost the same as that of offense, because the defenders try to prevent being scored and the attackers try to score. The first approach is based on scoring prediction, which evaluates plays based on changes in the expected values of goals scored and conceded based on a prediction of scoring using tracking data [4][5][6][7] (reviewed in [8]) and action data such as dribbling and passing [9], as well as other rule-based methods (e.g., based on the two distances of the ball player from the nearest defender and the goal [10]). In particular, Valuing Actions by Estimating Probabilities (VAEP) [9] was recently proposed as a framework for valuing player actions in soccer, which is based on the prediction model of the scores and concedes utilizing on-the-ball action such as dribbles and passes. The second approach is used to evaluate plays such as passes and effective attacks which lead to shots. For example, a previous study evaluated the value of passes based on relationships to the expected score and the difficulty in successfully completing a pass [11]. An effective attack can be defined as a play that will likely lead to a score [12]. Previous studies have analyzed pass networks [13] and three player interactions [14,15], pass reception [16], and of the related defensive weaknesses [17]. For defense, researchers have evaluated interception [18] and the effectiveness of defensive play by the expected value of a goal-scoring opportunity conceded [19]. For the third approach, a spatial positioning of the players is evaluated by calculating the dominant region with the use of a Voronoi diagram [20] and the Gaussian distribution [21]. Recent research has also been conducted on the evaluation of movements that create space for teammates [22,23]. Other approaches such as self-organizing maps using artificial neural networks have been reviewed in [3]. However, these approaches sometimes have several limitations. For evaluation based on the prediction of scoring (i.e., the first approach), the evaluation is not reliable because it predicts events that are rare throughout a game, and the process leading to the goals is sometimes difficult to evaluate. Furthermore, the second approach to evaluate specific plays that lead to goals and the third approach regarding positioning sometimes ignored the relationship with overall performance such as wins and losses (note that there have been some studies to investigate the relationship between performance indicators and the outcome such as the promotion to the elite leagues in [24]. Also, since many studies on the first and second approaches have used only the actions and coordinates of players around the ball, it would be difficult to evaluate players at greater distances from the ball and the team as a whole. To address these issues, we propose a method called Valuing Defense by Estimating Probabilities (VDEP), which utilizes the actions and positional data of all players and the ball. The ultimate goal of defense in soccer is to prevent the opposing team from scoring a goal. However, since goal-scoring scenes are rare events, it may lead to ineffective training of a classifier and evaluating the events in an unreliable manner when the dataset size is limited (the verification results of the VAEP method [9] will be presented later). Therefore, to reasonably evaluate the defense of a team, we propose the VDEP method to evaluate important factors for preventing goals from being scored. The VDEP method evaluates the potential increase in the number of ball recoveries and the potential decrease in the number of effective attacks. The number of effective attacks was chosen instead of the number of shots because we also regard the scenarios as defensive failures, in which an attacker selects to pass the ball rather than to shoot in the penalty area. Therefore, in this study, we evaluate the process of defense based on the expected value computed by the classifiers to predict ball recovery and being attacked in an analogous way of the VAEP method [9] based on the prediction of scoring and conceding. We can meaningfully discuss the results of VAEP and our approach because of the similar approaches.
The main contributions of this work are as follows: (i) the proposed method is based on the prediction of ball possession and effective attacks, which occur more frequently than the rare goals; (ii) based on a comprehensive perspective related to team outcomes, we evaluated the team's defense. Methodologically, we modified the existing method called VAEP [9], which is based on the classifiers to predict scoring and conceding, so that the defensive process can be evaluated by applying the approach to ball recovery and being attacked. We validated the classifiers of the proposed and existing methods and shows that the proposed classifiers predicted the true events better than the existing classifiers. Moreover, we examined the relationship between VDEP and the team performance in actual matches and throughout the season, as compared with VAEP. We also presented examples of evaluating a game and a complete season of a specific team.

Dataset
In this study, we used event data (i.e., labels of actions, such as passing and shooting, recorded at 30 Hz and the simultaneous xy coordinates of the ball) and tracking data (i.e., xy coordinates of all players and the ball recorded at 25 Hz) with 45 games from week 30 to week 34 of the Meiji Yasuda Seimei J1 League 2019 season. Note that we used "event" as the above meaning based on the previous studies [6,9]. These data were provided by Research Center for Medical and Health Data Science in the Institute of Statistical Mathematics (an academic organization) and Data Stadium Inc. That is, we did not estimate xy coordinates and directly used it for the subsequent analysis. Data acquisition was based on the contract between the soccer league (J League) and the company (Data Stadium, Inc), not between the players and us. The company was licensed to acquire this data and sell it to third parties, and it was guaranteed that the use of the data would not infringe on any rights of J League players or teams. We obtained the data by participating in a competition hosted by the academic organizations and the central idea of this study was independent of the competition (not restricted by the competition). In all 45 games, there were 106 goals scored, 1,174 shots, 3,701 effective attacks, and 9,408 ball recoveries (all based on the provided event data). An effective attack is defined as an event that finally ends in a shot or penetrates the penalty area. Also, ball recovery is defined as a change in the attacking team before or after the play due to some factors other than an effective attack. In this study, an effective attack is defined as being attacked from the defender's perspectives. It should be noted that for the labeling of effective attack/ball recovery, we labeled each event (not an attack segment) and there are four combinations of positive/negative effective attack/ball recovery. Since in soccer the attacking team transitions sometimes occur in a quite short time, we did not explicitly define an attack segment based on the previous work [9].

Proposed Method
Based on the motivation mentioned in the Introduction, here we describe the details of our proposed method. Suppose that the state of the game is given by S = [s 1 , . . . , s N ] in chronological order. We consider s i = [a i , o i ], whereas the previous study [9,19] used only a i , which includes the ith action involving the ball and its coordinates. The proposed method utilizes classifiers trained with the state s i , which includes the feature o i far from the ball (off-ball) at the time of the action. Since all defensive and offensive actions in this study are evaluated from the defender's point of view, the following time index i is used as the ith event.
Given the game state S i of a certain interval, we define the probability of future ball recovery P recoveries (S i ) and the probability of being attacked P attacked (S i ) in a state S i at an event i based on the classifier trained from the data. Defensive players are considered to act so that P recoveries (S i ) becomes higher or P attacked (S i ) becomes lower. Therefore, the value of defense in the proposed method V vdep is defined parameter C that adjusts the values of ball recoveries and effective attacks as follows: (1) In this study, we adjusted these values based on the frequency of each event in the training data. As described below, we determined C ≈ 3.9 because the ratio of ball recoveries and effective attacks is approximately 9, 408 : 3, 701 ≈ 3.9 : 1 (the value differs for each of five-fold cross-validation). In this study, we assume that the importance of ball recoveries and effective attacks is determined based on these frequencies, but this may be controversial. We discuss this point in the Discussion section as our future work.
Since the main aim of this study is to evaluate the team, we define the evaluation value per game for team p as follows: where M is the number of events for team p in a match and S p M is the set of states S of team p up to the M th event. Similarly, the mean values of only P recoveries and P attacked are defined as R recoveries (p) and R attacked (p), respectively. For the VAEP [9] method in the previous study, the value averaged by the playing time of each player was used. However, since the time each team played the game was almost the same, in this study, each team is evaluated by the sum of S vaep (p) as the VAEP value. Also, S scores (p) and S concedes (p) are used in the analysis as separate evaluation values based on the probabilities of scores P scores and concedes P concedes (note that the VAEP [9] value is calculated based on the prediction of goals scored and conceded).

Pre-processing and Feature Creation
The flow diagram of the analysis procedure is shown in Fig 1. In summary, we perform data pre-processing and feature creation, training classifiers, prediction with the classifiers, and computing VDEP. In this subsection, we describe pre-processing and feature creation. The time range of the input S i to the classifier was ith, i − 1th, and i − 2th events in the previous study [9]. In this study, since the effect of s i−2 on the prediction performance was small in the preliminary experiments, we used s = [a, o] including the ith and i − 1th events. In the first classification for estimating P recoveries (S i ), we assigned a positive label (= 1) to the game state S i if the defending team in the state S i recovered the ball in the subsequent k events, and a negative label (= 0) if the ball was not recovered. Similarly, in the second classification for estimating P attacked (S i ), we assigned a positive label (= 1) to the game state S i when an effective attack was made in the subsequent k events. An illustrative example is shown Fig 2A. In both classifications, k is a parameter freely determined by the user. If k is small, the prediction is short-term, smaller positive labels, reliable, and obtains unambiguous interpretation, and if k is large, the prediction is long-term, larger positive labels, includes many factors, and obtains ambiguous interpretation. Since it is intrinsically difficult to solve this trade-off, we set k = 5 by the domain knowledge.
Next, we describe the feature vector creation. we first created the 36 dimensional features of ith and i − 1th events. The feature a i near the ball in this study was constructed using the event and tracking data. Specifically, we used the types of events used in the previous study [9] (19 types: pass, cross, throw in, free kick, corner kick, trap, foul, tackle, interception, shot, PK, own goal, goalkeeper hand clear, goalkeeper catch, clearance, block, dribble, off-side, and goal kick. for details, see S1 Text). We also The computation of our VDEP is composed of four steps: pre-processing and feature creation, training classifiers, prediction with the classifiers, and computing VDEP. First, we perform data pre-processing and feature creation for the classifier using provided xy coordinates and event data. Second, we train the classifiers of ball recoveries and being attacked using xy coordinates in the training dataset. Third, the trained classifiers predict the ball recoveries and being attacked using xy coordinates in the test dataset. Finally, we computed VDEP using Equations 1 and 2. The existing VEAP [9] can be computed in a similar procedure.
used the event ID (1 dim), the start/end time and the duration of the event (3 dim.), the xy coordinates of the ball at the start and the end (4 dim.), the displacements of the ball from the start to the end (x, y, the Euclidean norm: 3 dim.), and those from the previous event start (3 dim.), the distance and angle between the ball and the goal (2 dim.), and whether there was a change in offense or defense from the previous event (1 dim.) Moreover, in this study, the off-ball feature o i at the time that the event occurred was included in the state s i . Specifically, we used the x and y coordinates of positions of all players (22 players xy coordinates) and the distance of each player from the ball (22 players), sorted in the order of closest to the ball. Finally, to reflect the opponent's attacking ability, we add the opponent team season scores (1 dim.) to the feature. Therefore, the feature has 139 dimensions in total (36 × 2 + 22 × 3 + 1 = 139).
In the data used in this study, defined by k = 5 above, the total number of events for all teams was 97,335, with 35,286 positive cases of ball recovery and 13,353 positive cases of being attacked. In terms of goals scored and conceded for the calculation of the VAEP value [9], there were 753 positive cases of goals scored and 227 positive cases of goals conceded (the total number of events was the same, but we set k = 10 in accordance with the previous study [9]). These indicate that goals scored and conceded are rare events compared to ball recoveries and being attacked (from the above reasons, it may be meaningful to compare VAEP with k = 10 and our VDEP with k = 5). Therefore, the goals scored and conceded may not be correctly verified by the area under the receiver operating characteristic curve (AUC) and accuracy in this study with a smaller dataset (compared with the larger dataset in the previous work [9]) as described below. blue: attacking team) and the flow of the event with the ball. In this scene, the VDEP values were positive in all events, suggesting that the defense was not so bad that the goal was conceded. The VDEP values decreased between Matsubara's pass and Erik's trap, and between Erik and Wada's trap and pass, suggesting that the goal was conceded because of a forward pass or because the ball holder was allowed to go free.

Prediction Model Implementation
We adopted XGBoost (eXtreme Gradient Boosting) [25], which was used in the previous study [9], as the classifiers to predict ball recoveries and being attacked. Gradient tree boosting [26] has been a popular technique leveraging a prediction algorithm that sequentially produces a model in the form of linear combinations of decision trees and performs well on a variety of learning problems with heterogeneous features, noisy data, and complex dependencies. XGBoost [25] is a variant of gradient tree boosting model which can be computed in a faster and more scalable way. Note that other classifiers can be used in the same framework. Although the prediction model itself do not consider the time series structure, according to the previous VAEP framework [9] and as described above, the prediction models reflect the history of the input (ith and i − 1th events) and that of the output (the subsequent k events).
When calculating VDEP and VAEP values, we used a five-fold cross-validation procedure. Here we define the terms of training, validation, and test (datasets). We train the machine learning model using the training dataset, validate the model performance using the validation dataset (sometimes for determining some hyper-parameters), and finally test the model performance using the test dataset. The benefit of such procedure is to verify a model which can test the performance using a new test data (not used during training). In our case, the validation data was not used and hyperparameters are predetermined as default in Python library "xgboost" (version 1.4.1). We did not use the early stopping method for the training. "Cross validation procedure" we used here is a test using the test dataset in an analogous way of the usual cross-validation to analyze all data even using a small dataset [27][28][29]. In the cross-validation, the original sample is randomly partitioned into five equal sized May 10, 2022 6/15 subsamples. Of the five subsamples, a single subsample is regarded as the test data for testing the model, and the remaining four subsamples are used as training (and validation) data. The cross-validation process is then repeated five times, with each of the five subsamples used exactly once as the test data. There seems to be no best practice to determine the number of folds and researchers usually select such as 5 or 10 in advance. A higher number of folds indicates that each model is trained on a larger training set and tested on a smaller test fold, which lead to a lower prediction error because the model can use more training data, but it takes more computational time and the smaller test data may inaccurately evaluate the error. In our case, the dataset size was not so large and we had five-week games data, thus we selected to use five-fold cross-validation. Actually, we repeated the learning of classifiers using the data of four weeks (36 games) and a prediction using the data of one week (nine games) five times (i.e., data of all five weeks were finally predicted and evaluated) to analyze all games. In this study, since we prioritized the analysis among teams, we need to utilize the limited data (5 games for each team) and we assume that the performances for each week game were independent. For future work, we need to verify our method using different datasets (e.g., obtained in different environments). Our all computations were performed using Python (version 3.6.8). In particular, we customized the published code of the VAEP method [9] using Python (https://github.com/ML-KULeuven/socceraction). We recorded the computational time of our method (VDEP) and the existing method (VAEP) during five-fold cross validation (i.e., 36 games for training and nine games for testing).

Evaluation and Statistical Analysis
To validate the classifier, we used the F1 score in addition to the AUC and Brier scores used in the previous study [9]. AUC is calculated by plotting the cumulative distribution function of the true positive rate against the false positive rate. The true positive rate is defined as the ratio of the sum of true positives and true negatives to the number of true positives. The false positive rate is defined as the ratio of the sum of false positives and true negatives to the number of false positives. AUC indicates 0.5 for random prediction and 1 for perfect prediction. Brier score is the mean squared error between the predicted probability and the actual outcome, where a smaller value indicates a better prediction. However, these evaluations and (more intuitive) accuracy score may not be better when there are extremely more negative than positive cases, as in this and previous studies. As an intuitive example, the accuracy of being attacked in VDEP and scored in VAEP will be 1 − 13353/97335 ≈ 0.863 and 1 − 753/97335 ≈ 0.992 when all negative cases are predicted. AUC and Brier score also have similar problems because these indices evaluate large amounts of true negatives. Since AUC also evaluates false positive rate, the similar problem to accuracy and Brier score can occur when small true positives and larger true negatives. In this study, we also used the F1 score to evaluate whether the true positives can be classified without considering the true negatives. The F1 score is expressed as F1score = (2 × Precision × Recall) / (Precision + Recall), where the Recall is equal to the true-positive rate, and the Precision is defined as the ratio of the sum of true positives and true negatives to false positives. In this index, only true positives are evaluated, not true negatives. To compare F1 scores among the various classifiers for testing our hypothesis (other AUC and Brier scores are shown only as references), since the hypothesis of homogeneity of variances between methods was not rejected with Levene's test, a one-way analysis of variance was performed. As a post-hoc comparison, Tukey's test was used within the factor where a significant effect in one-way analysis of variance was found. Furthermore, the contribution of the input variables to the prediction of the VDEP method was calculated by SHAP (SHapley Additive exPlanations) [30], which utilizes an interpretable approximate model of the original nonlinear prediction model.
For the evaluation of defense using the VDEP and VAEP values [9], we present examples to quantitatively and qualitatively evaluate a game and a season of a specific team. Next, we examined the relationships with the outcomes of actual games (goals scored, conceded, and winning points, where win, draw, and lose were assigned as 3, 1, and 0 points, respectively) and the relationship with the team results throughout the season using the Pearson's correlation coefficient among all 18 teams.
For all statistical analysis, p < 0.05 was considered significant. However, since the sample size was small (N = 18) in the correlation analysis, the r value indicating the magnitude of the correlation was also used as an effect size for evaluation. As described in a previous study [31], correlation coefficients less than 0.20 were interpreted as slight almost negligible relationships, between 0.20 and 0.40 as low correlation; between 0.40 and 0.70 as moderate correlation; between 0.70 and 0.90 as high correlation and correlation greater than 0.90 as very high correlation. In this study, the correlation coefficients were rounded off to the third decimal place for interpretation. All statistical analyses were performed using SciPy (version 1.5.4) in the Python library.

Verification of Classifiers
To validate the VDEP and VAEP [9] methods, we first investigated the prediction performances of their classifiers. As mentioned earlier, there are two classifiers of pass recoveries and being attacked in VDEP (similarly, the existing VAEP has two classifiers of scores and concedes). These classifiers predict probabilities of pass recoveries (P recoveries ) and being attacked (P attacked ) in VDEP (similarly, probabilities of scores P scores and concedes P concedes in VAEP). In Table 1, the classifiers of VDEP shows better predictions than those of VAEP [9] (note that the output and number of occurrences to be predicted are different). The AUCs of P recoveries and P attacked in VDEP were better than those of P scores and P concedes in VAEP, and vice versa in regard to the Brier scores. Specifically, Brier score is directly affected by larger true negatives (see the above accuracy example) and thus it may show a better result in VAEP than that of our VDEP. On the other hand, AUC evaluates true-positive rate and the biased effect was smaller than Brier score, thus it shows a better result in our VDEP than that of VAEP. However, again, these indices may not be validly evaluated because they include a large number of true negatives in the evaluation (thus, we did not perform statistical analysis in these variables). Instead, the F1 score was calculated, and the statistical analysis identified significant main effect among P recoveries , P attacked , and P scores (F = 144.40, p < 1.0 × 10 −6 ; P concedes was eliminated because the average is near zero value). The post-hoc analysis shows that F1 scores of VDEP (P recoveries , P attacked ) were significantly higher than that of P scores (ps < 0.002). This indicates that the VDEP method predicted true positives correctly, while the VAEP did not. For details, the confusion matrices of the four classifiers are shown in S1 Fig. Next, the contribution of the input variables to the prediction of the VDEP method was calculated by SHAP [30]. For P recoveries , in Fig 3, the distance to the ball of the defender closest to the ball had the highest contribution, followed by the events where there was an offensive or defensive change immediately beforehand. For P attacked , in Fig  4, the x-coordinate of the attacker closest to the ball (in the direction of the goal) and Table 1. Evaluation of classifiers for the proposed and conventional methods.

Examples of Team Defense Evaluation
Evaluation of a defensive play An advantage of the VDEP method is the ability to show the effectiveness of the formation of the defending team against the attacking team at a particular moment in the game. For example, with the use of VDEP for a goal conceded, it can be easily understood where the factor of the goal is placed in the series of events. As an example, consider the first goal in the match between Yokohama F. Marinos and FC Tokyo shown in Fig 2. A positive VDEP value can be interpreted as a good defense and a negative value as a bad defense. In this example, the VDEP values were positive in all events, indicating that the defense was not so bad that the goal was conceded. However, to be precise, the VDEP values decreased between Matsubara's pass and Erik's trap, and between Erik and Wada's trap and pass, suggesting that the goal was conceded because of a forward pass or because the ball holder was allowed to go free.

Evaluation of a game
Since it is sometimes difficult to score goals in soccer, the team that dominates the game does not always win. Therefore, to continuously strengthen the team, it is necessary to analyze the game regardless of the immediate outcomes. The VDEP method is expected to be used as a more stable evaluation index than wins and losses which are limited by contingent factors.
For example, in the match between Yokohama F. Marinos and FC Tokyo, Yokohama won the match by a score of 3 to 0. We examined the reasons for the unexpectedly large gap in the matchup of the two top teams (the numbers of shots taken by both teams were the same in the game). Although R recoveries for Yokohama (0.371) was better than that for Tokyo (0.348), R attacked and R vdep for Tokyo (0.116 and 0.049) were better than that for Yokohama (0.159 and -0.040). These indicate that Tokyo's defense made it difficult for Yokohama to score goals. As in this game, there are cases where the evaluation results do not match the game outcome, e.g., when the quality of shots was better even if the defensive evaluation was good (note that the proposed method did not reflect how likely an effective attack is to score). Thus, the use of the VDEP method to quantitatively evaluate the defense of each match will allow for a more detailed analysis than wins and scores.
Statistically, correlation analysis was performed between the outcome of the game and the proposed and existing indices (analyzed data is given in S1 Data). In the case of R vdep , there were moderate positive correlations with winning points (r 16 = 0.464, p = 0.050) and low positive correlation with goals scored (r 16 = 0.392, p = 0.106). In the case of S vaep , there were high positive correlation with winning points (r 16 = 0.830, p < 0.001) and very high positive correlation with goals The input variables related to the prediction of P recoveries are presented in the order of their contributions. Of the top 20 features, those at the top had greater contribution than those at the bottom. Each dot represents each event. The color represents the value of the feature (blue and red indicate low and high, respectively). The horizontal axis shows the impact on the prediction (strongly positive and negative impacts are plotted to the right and left, respectively). For example, when the value of type foul a1 is 1, the prediction is likely to be zero. For variable names, the a 0,1 means an analyzed event or one previous event. The x,y,p mean x,y coordinates and distance from the ball. The team 1 shows whether there was a change in offense or defense from the one previous event. The dx and movement are the x displacements and the Euclidean norm of the ball from the start to the end. The type is the variables related to the events. scored (r 16 = 0.953, p < 0.001). It is obvious that S vaep can sufficiently predict the number of goals scored in a match because it is based on the prediction of scores. Interestingly, even though S vaep is also based on the prediction of conceded goals, it had slight almost negligible relationships with goals conceded (r 16 = −0.040, p > 0.05). On the other hand , R vdep had low correlation with the goals scored in the game (r 16 = −0.245, p > 0.05).

Defensive evaluation of teams in multiple games
It is also possible to characterize and evaluate team defenses throughout a season using the VDEP values in multiple games. Fig 5 shows the average VDEP values for each team. For example, Yokohama was able to defend with a high probability of recovering the ball, suggesting the probability of a high number of goals (see S1 Data). On the other hand, the probability of being attacked was also high, suggesting that the team adopted a high-risk, yet high-return, defensive tactic. Meanwhile, Hiroshima that had the fewest number of goals conceded in the league (see S1 Data), shows high probability of ball recovery and low probability of being attacked, suggesting that these properties led to the small number of goals conceded.
Statistically, we performed the correlation analysis between the team's performance over the whole season and the evaluation indices (the data is shown in S1 Data). R vdep had moderate positive correlations with winning points (r 16 = 0.397, p = 0.103), and low correlation with goals scored (r 16 = 0.342, p = 0.162) and goals conceded (r 16 = −0.291, p = 0.239) . Meanwhile, S vaep had moderate positive correlation with goals scored (r 16 = 0.497, p = 0.034), but slight almost negligible relationships with winning points (r 16 = 0.177, p > 0.05) and goals conceded (r 16 = −0.098, p > 0.05). In the case of VDEP, the correlation coefficients with the game performances and those with the entire season were similar, whereas, in VAEP, the associations differed. Defensive evaluation of teams in multiple games. The vertical axis is R attacked and the horizontal axis is R recoveries . The vertical and horizontal lines are the averaged values of R attacked and R recoveries among all teams, respectively. The more points plotted to the right, the more likely the defense is to recover the ball, and the more points plotted below, the less likely the defense is to concede. The black line is the league average. For example, Yokohama defended with a high probability of recovering the ball. On the other hand, the probability of being attacked was also high, suggesting that the team adopted a high-risk yet high-return defensive tactic.

Discussion
In this study, we proposed a method to comprehensively evaluate a team's defense related to the team's performance, based on the prediction of ball recovery and being attacked (which occur more frequently than goals), using player actions and positional data of all players and the ball. First, we verified the proposed and existing indices based on the prediction performance. Second, we quantitatively analyzed the defensive evaluations of the proposed and existing methods. Finally, we discuss the limitations of the proposed methods and future perspectives.
VAEP [9] and the proposed VDEP evaluate players and teams based on the assumption of better prediction performance. To validate the classifiers, the previous study [9] used AUC and Brier scores. However, as mentioned above, these indices may not be reliably evaluated because they include a large number of true negatives. Therefore, we computed the F1 score and the results showed that the VDEP method predicted true positives correctly, while the VAEP did not. This suggests that the VDEP method was a reliable method that can evaluate defensive performances based on better predictions.
Regarding the team evaluations using the proposed and existing indices, the correlation analysis revealed a moderate positive correlation between the season outcome (winning points) and the proposed VDEP value, whereas there were strong positive correlations between the game outcome (winning points and goals scored) and the existing VAEP value [9]. Furthermore, overall, in the VDEP value, the correlation coefficients with the analyzed game performances and those with the entire season were similar, whereas those of the VAEP value were very different. These results suggest that R vdep could be a well-balanced indicator to evaluate both attacks (after the ball recovery) and the defense itself (prevention of being attacked and the ball recovery). On the other hand, the VAEP method [9] is based on the prediction of offensive play and shows no correlation with the goals conceded. We expect that the use of VDEP in addition to the various indicators used so far will lead to the continuous strengthening of the team, regardless of immediate wins and losses which would be associated with contingent factors.
There are several recommendations for future studies. The first is the increase in the number of analyzed games for better prediction of longer-term game performances. The second is the determination of the weighting constant C in Equation 1 for ball recovery and being attacked. Although this study determined C based on the number of occurrences of both events, the constant should be determined in more suitable ways for the practical values in soccer. The third is an evaluation of the proposed method. Since such a new evaluation index often has no ground truth (or golden standard), we cannot validate our method using ground truth. In our future work, we need to construct a quantitative framework to evaluate such a method or to perform a subjective evaluation. The last is the evaluations of individual players. Since VDEP evaluates team defense, it is difficult to evaluate the performance of individuals. For example, future studies are necessary to compute the change in VDEP when a player moves in different directions.