Audio-based performance evaluation of squash players

In competitive sports it is often very hard to quantify the performance. A player to score or overtake may depend on only millesimal of seconds or millimeters. In racquet sports like tennis, table tennis and squash many events will occur in a short time duration, whose recording and analysis can help reveal the differences in performance. In this paper we show that it is possible to architect a framework that utilizes the characteristic sound patterns to precisely classify the types of and localize the positions of these events. From these basic information the shot types and the ball speed along the trajectories can be estimated. Comparing these estimates with the optimal speed and target the precision of the shot can be defined. The detailed shot statistics and precision information significantly enriches and improves data available today. Feeding them back to the players and the coaches facilitates to describe playing performance objectively and to improve strategy skills. The framework is implemented, its hardware and software components are installed and tested in a squash court.


Introduction
At present in competitive sports there are a lot of talented sportsmen and the differences between individual performance are often very small to spot. It catalyses a race condition to be present already in the practising period, thus more and more coaches and players seek finding different means and aids to elaborate and make the preparation for the tournaments always more effective. There are a lot of new technological achievements available in the market. Small electronic devices are capable of measuring various metrics including those that are relevant for the sports, like heart rate and blood temperature and pressure registers, pedometers, speedometers and accelerometers to name a few. Using such devices is more than necessary since the results in a competition and then the final scores may depend on millesimal of millimeters. Another reason why to use measurement devices yielding objective performance metrics is because when sportsmen are overloaded in a performance, with adrenalin in their vein, it is hard if possible for them to spot and fix their failures. In certain types of sports a continuous or prompt feedback is definitely helpful, squash is one of them. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Squash is a very rapid ball and racquet game with typically 40-60 hit events per minute. Depending on the various surfaces the ball interacts during its flight defines the different shot classes. Some shot classes are very rare due to being tricky to deliver or may occur only in circumstances where the rally may seem already lost. So knowing the detailed statistics of various hits and shot patterns talks about the quality of the sportsmen and are very important information for both the coaches and the squash players. However, these data and their statistical analysis are not available at present because of the paste of squash. Given its fast speed the human processing of events enables the score registration in real-time only, but the recording of shot types and the detailed sequences of the shots are rendered definitely impossible. One possible solution might be to analyse videos of the matches using image processing as it has been shown to work for the tennis [1]. Though for the squash it turns out that this approach remains difficult even with the use of high speed and high resolution cameras, due to the small size of the ball and the view provided by the cameras. Traditionally cameras are placed behind the court, therefore the players will most often cover the sight of the ball during the match making the reconstruction of ball trajectories an inauspicious problem. To provide reliable statistics by this approach will require human processing and validation so in the end a thorough analysis of the tournament will cost many times of the duration of this sport events in man-hours.
In this study we introduce a framework to unhide these information based on the analysis of acoustic data. Playing squash produces characteristic sound patterns. The sound footprint of each rally is a projection of all the details about the strength and the position of the ball hitting various surfaces in the court. Naturally, this pattern, which maintains the natural order of the events, is contaminated by some additional noise. Recording the sound in more directions allows for inverting the problem and for giving statistical statements about where and what type of an event took place in the play. We are focusing on events generated by the ball hits, which serves as a basis for further analysis and the reconstruction of shot patterns or the ball trajectories. Note, the framework to be detailed can be applied to various other types of ball games.

Related work
Squash and soccer were the first sports to be analysed by ways of analysis systems. Formal scientific support for squash emerged at the late 1960s. The current applications of performance analysis techniques in squash are deeply investigated in the book of Stafford et al. [2].
One test that was developed by squash coach Geoffry Hunt is the "Hunt Squash Accuracy Test" (HSAT) [3], that is a reliable method used by coaches to assess shot hitting accuracy. The test is composed of 375 shots across 13 different types of squash strokes and it is evaluated based on a total score expressed as the number of successful shots.
Recent technological advances have facilitated the development of sport analytical software such as Dartfish video based motion analysis system [4,5]. However, these systems still require a considerable amount of professional assistance.
To the best of our knowledge there is no previous research investigating the applicability of sound analysis techniques for squash performance analysis, therefore it is not possible to directly compare our system to existing solutions. In other application environments a wide literature can be found on real-time sound source localization that is the most closely related topic to our work. The emerging application of camera pointing in video conferencing environment motivated many research papers on the field of visual speaker localization [6][7][8][9][10]. A linear-correction least-squares estimation procedure is proposed in [6][7][8]. The simulation results in [7] show that the bias level of this technique is around 30 cm. In the work of Tobias Gehrig et al. [10] a method is presented to speaker tracking using audio-visual features, namely time delay of arrival estimation on microphone array signals and face detection on multiple camera images. The sound source localization is based on a maximum likelihood approach. In the experimental results the authors measure 57.2 cm root mean square error for the audioonly solution and 49.9 cm for the audio-video approach. One conceptually simple solution for source localization is beamforming [11], where the source location is estimated by calculating the steered output power of a beamformer over a set of candidate locations. Although this concept has advantages in speech localization and enhancement, it is computationally expensive and its resolution is too low for our purposes.

The measurement equipment
This study is based on the analysis of sound waves generated during the squash play. Among many other, squash is a game where various different sources of sound are present, including the players themselves (their sighing or their shoes squeaking on the floor), the ball hitting surfaces (like the walls, the floor or the racquet) and also external sources (including the ovation of the spectators or sound generated in an adjacent court). Here we focus on audio events related to the ball.
When planning the experiments the following constraints had to be investigated and satisfied. The framework should be fast in signal processing point of view, because the target information can be most valuable when in a competitive situation it helps fine tune tactical decisions made by the coach and/or the player. The cost of the equipment should be kept low and the installation of the sensors requires a careful design to prevent them from disrupting the play. As the spatial localization of the ball is one of the fundamental goals a lower bound to the sampling rate is enforced to remain able to differentiate between displaced sound sources.
In sample frame are in synchrony. The highest sampling rate of the sound card is used (96 kHz), so by each new sample the front of a sound wave travels approximately 6 mm.
According to their functionalities software components fall in the following groups. Signal processing is done in the analysis module, which include the detection of the audio events, the classification and the filtering of the detections and after matching event detections of more channels the localization of the sound source. While these steps of signal processing can be done real-time a storage module is also implemented so that the audio of important matches can be recorded. Recording of data helps training of the parameters of the classification algorithms, and it also enables a whole re-analysis of former data with different detectors and/or different classifiers. All output generated by the Analysis module is fed to the output queue. Hardware and software components are triggered and reconfigured via a web services API exposed by the Control interface. Finally, to be able to listen to what is going on in the remote court a Monitoring interface provides a mixed, downsampled and compressed live stream across the web.

The ball impact detection
The localization and the classification of ball hits both require the precise identification of the beginning of the corresponding events in the audio streams. The detection of ball impact events is carried out for each audio channels independently and in a parallel fashion, which speeds up the overall performance of the framework significantly. Different detection algorithms of various complexities were investigated two extreme cases are sketched here. The first model assumes that the background noise follows the normal distribution. An event is detected if new input samples deviate from the Gaussian distribution to a certain predefined threshold value. Next for each channels the mean and the variance estimates of a finite subset of the samples are continually updated according to the Welford's algorithm [12].
The second method is an extension of the windowed Gaussian surprise detection by Schauerte and Stiefelhagen [13]. The algorithm tackles the problem evaluating the relative entropy [14]. It is first applied in the frequency domain and if there is a detection then a finer scale search is carried out in the time domain. The power spectrum of w-sized chunks of windowed data samples is calculated. Between detection regime the series of the power spectra is modelled by a w-dimensional Gaussian. The a priori parameters of the distribution are calculated for n elements in the past, and the posteriori parameters are approximated including the new power spectrum. The Kullback Leibler divergence between the a priori and the posteriori distributions exceeds a predefined threshold when a new detection takes place where primed parameters correspond to the posteriori distribution. The time resolution at this stage is w and to increase precision a new search is carried out in the time domain evaluating the Kullback Leibler divergence for 1-d data. In order to bootstrap a priori distribution parameters n samples from the former windows are used.

The localization of sound events
In this section we lay down a probabilistic model to determine the time and location of an audio event. For a unique event we denote these unknowns t and r ev respectively. The location vector r ev is a 3 dimensional array of Descartes coordinates (x, y, z), however, the calculation presented here also applies for lower dimension setups. The inputs required to find the audio event are the locations of the N + 1 detectors r mike i and the timestamps τ i when these synchronized detectors sense the event (0 i N).
The probability that microphone i detects an event at (r, t) is where c is the speed of sound, t i = τ i − t is the propagation delay and r i ¼ jjr À r mike i jj is the distance between the sound source and the microphone. The uncertainty σ i depends on the characteristics of the microphone, which we will consider constant in the first approximation.
By introducing relative delayst i ¼ t i À t 0 the joint probability of relative delays detected is The formula can be rearranged is a quadratic function and in the expression for p the Gaussian integral follows The first order derivative f 0 vanishes in This formula can be interpreted as a variance formula, which can be rewritten A good approximation of the audio event maximizes the likelihood p, which at the same time minimizes f ðt Ã 0 Þ, thus we seek the solution of r r f ðt Ã 0 Þ ¼ 0 equations. In practice f behaves well and its minimum can be found by gradient descent method. Fig 2  shows a situation, where the ball hit the front wall and 6 microphones detect this event error free. To show the functions behaviour f is evaluated in the floor, in the front wall and in the right side wall. Finding the minimum of f takes less than ten gradient steps.
The likelihood based localization model is derived for a noiseless situation, assuming the perfect detection of samples in each channel. In real environment, however, noise is present and the error deviating the detection is exposed in the final result of the localization. In order to track this effect the method was numerically investigated as follows. 10000 points in the volume of the court is selected randomly and the sound propagation is calculated in each six microphones. Next for the ideal detections Gaussian noise is added in all channels, with increasing variation (σ = 1, 10, 50). In Fig 3 the noiseless case is compared to cases with increasing errors. In the figure the cumulative distribution of the error, ie. the difference between the randomly selected point and the location guess by the model is presented. Naturally, by increasing the detection error the error in the position guess is increasing, but the

Classification
It is the task of the classification module to distinguish between the different sound events according to their origin. Sound events are classified based on the type of the surface that suffered from the impact of the ball. This surface can be the wall, the racquet, the floor or the glass. When the sound does not fit any of these classes, like the squeaking shoes, then it is classified as a false event. The classification enhances the overall performance of the system by two means. First, skipping to localize the false events speeds up the processing. And second, in doubtful situations when the calculated location of the event falls near to multiple possible surfaces, by knowing the type of the surface that suffered from the impact can reinforce the localization. For example a sound event localized a few centimetres above the floor could be generated by a racquet hit close to the floor or by the floor itself.
Classification utilizes feed-forward neural networks that had been trained with backpropagation [15][16][17][18]. The training sets are composed of vectors belonging to 5461 audio events, which have been manually labelled. Based on these audio events two types of input were constructed for teaching.
In the first case temporal data is used directly. A vector element of the training set T 1 is the sequence of the samples around the detections for each channels.
where the channel index is dropped and d is a unique detection and w sets the length of the vector. Given the sampling rate 96 kHz and setting w = 300 the neural network is taught by 6.25 millisecond long data.
The second feature set T 2 is built up of the power spectra. where F denotes the discrete Fourier transform.
A single neural network model where all event classes are handled together performed poorly in our case. Therefore, separate discriminative neural network models were built for all four classes (racquet, wall, floor and glass impact) and for both of the training sets. It has also been investigated if any of the input channels introduce discrepancy. In order to discover this effect models were built and trained for each unique channels and another one handling the six channels together. Note, that not all possible combinations of the models were trained due to the fact that some channels poorly detected certain events, for example microphones near the front wall detected glass events very rarely.
In the training sets the class of interest was always under-represented. To balance the classifier the SMOTE [19] algorithm was used, which is a synthetic minority over-sampling technique. A new element is synthesized as follows. The difference between a feature vector from the positive class and one of its k nearest neighbours is computed. The difference is blown by a random number between 0 and 1, to be added to the original feature vector. This technique forces the minority class to become more general, and as a result, the class of interest becomes equally represented like the majority set in the training data.
Different network configurations were realized to find that for the direct temporal input a 20 hidden layer network (with 10 neurons in each layer) performed the best, while for the spectra input a 10 hidden layer (each layer with 10 neurons) is the best choice.

Analysis
In this section the performance of each modules of the framework and the datasets are presented.

Datasets
In order to analyse the components of the framework implementing the proposed methods two audio and video record sets were used. Datasets are available at https://figshare.com. Audio 1 was recorded on the 18th of May 2016 when a squash player was asked to target specific areas of the wall. This measurement was necessary to increase the cardinality of the different hits significantly in the training datasets T 1 and T 2 , and it was also manually processed to be able to validate the operations of the detector and the localization components. Audio 2 resembles data in a real situation as it contains a seven minutes squash match recorded on the 8th of March 2016. Table 1 summarizes the details of these audio recordings.
Training the neural network models require properly labelled datasets. After applying the ball impact detection algorithm to the audio records the timestamps of the detected events were manually categorized as front wall event, racquet event, floor event or glass event. In the categorization procedure video files helped in doubtful situations. Every sudden sound effect that was detected by the algorithm but does not belong to these relevant classes was labeled as false event. In Audio 1 prescribed audio events were generated and recorded and it does not contain any false events. In contrast, Audio 2 was captured in a real situation and it presents several false events by nature.

Detection results
The performance of the detector is analysed by comparing the timestamp reported by the detector d detector and the human readings d human . For Audio 1 in Fig 4 the cumulative probability distribution of the time difference is shown for each channel and in Table 2 the average error and its variance are shown grouped by the event types present in the dataset. One can observe that the detectors in channels ch4 and ch5 perform poorly for front wall and racquet events. When estimating the position discarding one of or both of these channels will enhance the precision of the localization. However, for floor events, these two channels performed the best along with channel ch2. For glass events the smallest deviations were measured on channels ch2, ch3 and ch5.
In Table 3 the error statistics for dataset Audio 2 is shown. Intensive events, like front wall impacts, can be detected precisely, whereas the detection of milder sounds like a floor or glass impact is less accurate.
The false discovery and the false negative rate of the detector were examined on Audio 2. False positives are counted if detector signals for a false event, and false negatives are the missing detections. The results are summarised in Table 4.
The different settings of the microphones and the distinct acoustic properties of the squash court at the microphone positions are found to be the reasons of that phenomenon.
Eight-fold cross-validation [20] was used on the datasets to evaluate the performance of the classifiers. Three measures are investigated closer: the accuracy, the precision and the recall. Accuracy (in Fig 5) is the ratio of correct classifications and the total number of cases examined ( n tp þn tn n ). Precision (in Fig 6) is the fraction constrained to the relevant cases ( n tp n tp þn fp ). Recall (in Fig 7) is the fraction of relevant instances that are retrieved ( n tp n tp þn fn ).  Audio-based performance evaluation of squash players Table 5 summarises the results of the best classifiers for each class. It can be seen that the classification of the front wall and the racquet events is reliable. However, the precision and the recall of floor and glass events are poor. The reason for it is that these classes are under-represented in the data sets. Whenever x, an unseen sample comes, the best classifiers of each  class are applied on the new element. The prediction of the class labelŷ to which x belongs to is computed by the following formula: ; 9k : f k ðxÞ > cut k false event; otherwise where C is the set of class labels without the class of false events and f k (x), cut k and prec k are the confidence, the cutoff value and the precision of the best classifier in class k respectively.

Localization results
Based on the geometry of the court, the placement of the microphones and using the localization technique detailed in this study for each set of detection timestamps the 3-d position of the source of the event can be estimated. In case not all source channels provide a detection of the event localization is still possible. Four or more corresponding timestamps will yield a 3-d estimate, whereas with three timestamps the localization of events constrained on a surface (e.g. planes like wall or floor) remains possible.
In Fig 9 the located events present in dataset Audio 1 are shown. In this measurement scenario the player was asked to hit different target areas on the front wall. It was a rapid exercise, as the ball was shot back at once. Only a few times the ball hit the floor, most of the sound is composed of alternating racquet and front wall events. In Fig 10 the front wall events are shown. The target areas can be seen clearly, and also it is visible the spots scatter a little more Audio-based performance evaluation of squash players Audio-based performance evaluation of squash players on the left. The reason could be the player being right handed or the fact the target area was hit later during the experiment and the player showed tiredness.
Measuring the error of the localization method is not straight forward because the ball hitting the main wall does not leave a mark, where the impact happened and there was no means to take pictures of these events. Taking advantage of the geometry of the front wall an error metric can be defined for front wall events. The error δ is defined by the offset of the approximated location from the plane of the front wall. In Fig 11 the error histogram is shown. The mean of δ should vanish and the smaller its variance the better the framework located the events. From this exercise one can read the standard deviation is σ(δ) < 3 cm, which is smaller then the size of the squash ball.
Another way to define the error is based on relying on human readings of the events. In the dataset Audio 1 all of the sound events were marked by human as well as by the detector algorithm. Localizing the events using both inputs the direct position difference can be investigated. The mean difference between the positions is 11.8 cm and their standard deviation is 39.9 cm.

Discussion
Our results support that in sports, where the relevant sound patterns are distinguishable, careful signal processing allows the localisation of shots. The described system is optimized for handling events and as a consequences the real-time analysis of data is possible, which is important to give an instant feedback. The framework can be extended to provide higher level statistics of events such as the evolution of shots types. From the wide range of possible applications we highlight three use cases. Firstly, during a match the players can get to know their precision in short time and if is necessary they can change their strategy. Secondly, during practice coaches can track the development of the players hit accuracy. Or thirdly, certain exercises can be defined, which can be automatically and objectively evaluated, without the need for the coach be present during the exercise.
In this study our framework was adopted to squash. In theory it can be extended to any other sports where the stereotypical events are associated with a specific sound pattern. In those applications, where typical patterns are present but the surroundings introduce significant amount of noise, a solution could be to use additional microphones with possibly special characteristcs to record the noise allowing to subtract its contribution from all other input signals. For example in tennis games played in the open field.

Ethics statement
In this study human participants were instructed to carry out specific squash excercises. Participants were informed beforehands about the fact that during the excercise the sound is to be recorded. During the excercises the sound emerging mainly from the ball impacts was recorded. The recording itself do not contain any sensitive information. Along with the raw recordings no additional information about the participant is saved or published. Participants do not object that these recordings are made public.