The hot hand in the wild

Streaks of success have always fascinated people and a lot of research has been conducted to identify whether the “hot hand” effect is real. While sports have provided an appropriate platform for studying this phenomenon, the majority of existing literature examines scenarios in a vacuum with results that might or might not be applicable in the wild. In this study, we build on the existing literature and develop an appropriate framework to quantify the extent to which success can come in streaks—beyond the stroke of chance—in a natural environment. Considering in-game basketball game situations, our analysis provides statistical evidence that individual players do indeed exhibit the hot hand in varying degrees, that is, individual players can consistently get in a streak of successful shots beyond random chance. However, as a whole, the average player exhibits shooting regression, that is, after consecutive makes he tends to perform below expectations. Even though our results are based on a sports setting, we believe that our study provides a path towards thinking of the hot hand beyond a laboratory-like, controlled environment. This is crucial if we want to use similar results to enhance our decision making and better understand short and long term outcomes of repeated decisions.


Introduction
Recently a video showing Steph Curry, the Golden State Warriors point guard, making 105 consecutive three-pointers during a practise session went viral 1 .Soon after people were trying to estimate the probability of this happening, and they came to the conclusion that in order for this to even have a probability greater than 50% of happening, Curry's true shooting percentage on that specific shot should be 99.5% 2 !This video also triggered again the discussion about the existence or not of the "hot hand", with analysts using this video as evidence that the hot hand truly exists 3,4 .
From the classic work of Gillovich, Vallone and Tversky (GVT) 5 , to the most recent major development from Miller and Sanjurjo 6 that reversed the GVT conclusions, the hot hand has occupied researchers in social, statistical and computational sciences.Sports have had a prominent position in these studies since they provide a native environment for studying the streaks of success.Even though the vast majority of studies have focused on basketball (the original setting of the GVT study), other sports have seen their share of hot hand studies [7][8][9][10] .Recently similar studies have been conducted for understanding streaks in creative and scientific careers 11 .Dorsey-Palmateer and Smith 7 point out an important, data-related, shortcoming in the original GVT study -and many that came after -in the treatment of the different shots as identical.GVT themselves mentioned possible confounding factors (e.g., taking tougher shots, defensive adjustments etc.) when analyzing shot data from games.They overcame this problem by using free throw data -that can be thought of as identical attempts -and controlled shooting field experiments.This is in general the approach taken by the majority of the literature when studying the phenomenon in basketball, i.e., controlled shooting experiments or analysis of streaks at free throw attempts or three-point contests.Nevertheless, it should be evident that these settings do not accurately represent the context within which a basketball player is going to perform.In fact, the aforementioned practise session video with Curry's 105 consecutive made three-point shots can also be thought of as a controlled environment, given that all shots are taken under the same conditions (ignoring the potential fatigue during his 100th shot, as compared to say his first one).Hence, it is not clear whether the conclusions from studies based on a controlled environment can be generalized to situations experienced in the real world.In this study, we will eliminate this assumption by utilizing a dataset from shots taken during NBA games.The rich metadata included will allow us to account for the differences in the shot qualities between consecutive shots.Quantifying the quality of a shot (or the shooting ability of a player) has drawn attention in the era of tracking data.Second Spectrum -the NBA's provider of tracking data -has developed its own proprietary model (quantified Shot Quality -qSQ) 12 .Franks et al. 13 and Cervone et al. 14 used information from these tracking data to build a hierarchical logistic regression model for the success probability of a shot.Similar hierarchical models have been shown to be able to discriminate better the shooting ability of players by shrinking empirical estimates toward more reasonable prior means 15 .While these type of models make use of information at the time of shot release, more recent models use the full trajectory of the ball post-release 16,17 .These models improve the accuracy of the models, at the cost of requiring ball trajectory information.However, estimating the quality of a shot, i.e., the shot make probability, introduces another challenge in the analysis.In particular, the shot make probability model will inevitably introduce an error in the estimation.We also introduce a mechanism in our framework that accounts and adjusts for this error.
For our analysis, we will build and expand on the seminal work by Miller and Sanjurjo 6 .They showed that the GVT studyand others following it -suffer from the streak selection bias (explained in detail in the Methods section) that could lead to understimation of the hot hand phenomenon.Combined with the aforementioned non-identical property of the trials in our setting, existing approaches can lead to biased estimates of the hot hand effect from actual games.We develop a framework (details provided in the Methods) that can be applied to situations appearing in the wild.In particular, we extend the empirical test introduced by Miller and Sanjurjo 6 to cases where the individual trials of a sequence are not identically distributed.While we apply our test on data from basketball games, it should become evident that it is applicable on any binary sequence obtained through non-identical trials.

Materials and Methods
We will start by providing the intuition behind the important observation from Miller and Sanjurjo 6 .They pointed out that the GVT study -and many others with a similar analytical approach -suffered from the streak selection bias that while it appears in all sample sizes, is particularly pronounced in small samples.Based on the streak selection bias, if we take a finite series of makes (M) and misses (X) for a sequence of independent trials with constant success probability of 0.5, and we randomly choose one of the makes, then the probability that the next attempt is also a make is lower than 0.5.For example, if we consider a sequence of 20 shots with exactly 10 makes and 10 misses, then once we randomly select one of the makes, the probability that the next shot is a make is ] perm .This null essentially states that the probability of observing a make following k consecutive makes in the data is the same as the one expected if we permuted the sequence randomly, and hence, there is no serial correlation between makes.For example, let us consider the sequence of shots shown in Figure 1.Let us further assume that all shots are from the same spot (e.g., practise right corner 3 shots such as the ones from Steph Curry).For k = 1, we we have Pr [M|M] data to be equal to 0.55.We randomly permuted the shots 500 times and we obtained the distribution for Pr[M|M] perm presented at inset (A).which is what the GVT study used, starting with the assumption that the null hypothesis is true -i.e., there is no hot hand -we obtain a p-value of 0.1.This essentially translates to a probability of the observed sequence being a result of chance that is twice as large as compared to the permutation test, thus underestimating the phenomenon.
We extend the permutation test to account for the non-identical nature of shots in an actual game setting.In particular, from an actual game we observe a sequence of shots for a player but not every shot has the same success probability.To estimate the make probability p i of a shot s i we use a dataset obtained from the SportVU optical tracking system for the 2013- For our model, we build a feedforward neural network with 4 hidden layers.Each layer has 250 units and relu activation, i.e., ReLU(y) = max(0, y).We use a validation set for early stopping during training.Our final model is evaluated on a held out set and has an accuracy of 66%, which is on par with the state-of-the-art shot make probability models 17 .We also evaluate the calibration of the output probabilities.The calibration is a measure of the accuracy of the predicted probabilities, rather than the pure -binary -outcome.Simply put, if there are two models M 1 and M 2 that predict a shot will be made with probabilities 51% and 85% respectively, and the shot is eventually made, while both have the same accuracy, they have different probability calibration; M 2 is more certain and better calibrated as compared to M 1 .A typical way for evaluating the probability calibration of a classifier is through the reliability curve, that is, the plot of the predicted probabilities from the classifier versus the observed probabilities.In particular, we group the shots in our test set based on their predicted probability, and for each set, we estimate the fraction of them that were actually made.The latter constitutes the observed probability.Ideally, for a well-calibrated model these two probabilities should be equal.This essentially means that among all the shots that the model predicted x% probability of being made, only around x% of them were actually made.Inset (B) at Figure 1 presents the reliability curve for our shot probability model as obtained through an (out-of-sample) test set.As we can see, our predicted probabilities follow closely the observed shot make probabilities.
We can now estimate the shot make probability for each shot in the data, which we will use for identifying the presence of "hot hand" or not.In order to make sure our predictions are out-of-sample, we perform a "leave-one-season-out" (LOSO) training.I.e., in order to make out-of-sample predictions for one season, we train the model on the rest of the seasons (in our case just a single other season).The neural network architecture used in each training is the same, and the out-of-sample performance of each LOSO model in terms of accuracy and calibration is statistically indistinguishable from the overall performance discussed above.LOSO ensures that there is no data leakage from the model training to our analysis and the estimation of the hot hand effect.We use players with at least 1000 shots during the two seasons.Given that there are 82 regular season games in each season, this means that we filter out from our analysis players that took approximately less than 6 shots per game.This threshold was chosen in order to provide, on average, sequences from individual games that can be used to examine the hot-hand hypothesis for values of k > 1.As we will see in our results, we are able to examine up to k = 4. Furthermore, for each player we only consider games for which we have information for all of their shots in that game.For example, in some cases the distance of the closest defender or the touch time is not provided.In this case we cannot estimate the shot make probability, and we filter out not only this specific shot, but all the shots from the same game, since the sequence will essentially be broken.
The above process provides us with a sequence of n shots S ∈ {M, X} n and an associated shot make probability vector P ∈ ℜ n , which we term as the shot history vector where p i = Pr[s i = M].Given that the shots are not identical, we cannot simply permute the outcomes of the shots.However, we can repeatedly simulate the heterogeneous Bernoulli process describing the shot sequence to obtain Pr[M| M . . .M k times ] sim and examine whether there is enough statistical evidence to reject the null hypothesis of no hot hand.In this case the null translates to the probability of observing a make following k consecutive makes in the data being the same with the one expected if we only considered the quality of the shot -i.e., independent of the outcome of the previous shot.
Despite the overall good performance of the model in terms of predicted probabilities (Figure 1 ] data will include this error as well.In order to adjust for this, we also simulate for each player a random sequence of shots taken from the player and compare his expected from the model field goal percentage Pr[M] sim with his actual field goal percentage Pr[M] data .This difference Pr[M] sim − Pr[M] data provides an estimate of the error introduced by the model for this specific player and can be used to adjust the estimation of the hot hand effect size. By applying the test described above on every individual player, we can identify whether there is statistical evidence supporting the hypothesis that they are "streaky shooters" (i.e., they get the hot hand).However, given that we perform multiple tests -one per qualified player -the probability of obtaining a small number of false positive results for a given significance level α considered increases.As a robustness check, we can estimate the probability of a specific number of these tests returning positive results purely by chance.In particular, under the -realistic in our case -assumption that our tests are not correlated we can use the Binomial distribution for a meta-test.With M tests each of which has a probability of α leading to a false positive result, we can estimate the probability of observing at least r positive tests due to chance as: ∑ M p=r M p α p (1 − α) M−p .If this probability is low, it consequently increases our confidence that the hot hand instances observed are not (all) false positives.This is indeed the case in our setting (see Results section).

Results
For our results we start by estimating from game data the probability of a player π making a shot i, conditioned on the previous shot they attempted (in the same game) being made as well Pr ] sim we develop a shot make probability model based on a variety of contextual information (e.g., distance from the basket, distance of closest defender, shot type etc.).This sequence of shots is thus, a non-homogeneous Bernoulli process B(n, P), where P is the shot history make probability vector for the sequence.Sampling this process for each player 100 times will provide us with the distribution for Pr[M π | M π . . .M π k times ] sim .However, as aforementioned there is an error associated with the shot make probability model, which will affect the estimation of Pr[M π | M π . . .M π k times ] sim .To adjust for this error we simulate a randomly selected set 4/8 S n,π of player's π shots -regardless of the success or not of his previous attempt(s) -and estimate the model error for π as ε π = Pr[s i = M] data − Pr[s i = M] sim , s i ∈ S n,π .The adjusted hot hand effect size for player π is then êπ,k = e π,k − ε π .
To reiterate, we use both seasons in our dataset, but when estimating the shot make probability history vectors we ensure these estimates are out-of-sample.Before looking all the results let us see examine the data for a specific player, Kemba Walker (KW), just as an illustration of the estimation process.We start by calculating the probability of KW making a shot he takes after consecutive makes of his last k attempts (during the same game), i.e., Pr[M KW | M KW . . .M KW k times ] data .We then simulate his shot sequence after k consecutive makes 100 times to obtain his hot hand effect size e π,KW .Table 1 shows the observed and expected from the shot make probability model field goal percentages after k consecutive makes.As we can see KW performs better than expected after consecutive made shots.We then adjust for the model error, by simulating a random sample of KW's shots (these are null shots, i.e., they are not conditioned on the result of his previous shots) and comparing their observed and the expected field goal percentages as well.For each different value of k, the number of shots we sample is equal to the number of shots available for estimating Pr[M KW | M KW . . .M KW k times ] data .Given that for larger values of k the sample size is significantly reduced, we want our model error estimate to account for the underlying uncertainty.For reference, for k = 1 our dataset includes 333 shots that KW took after a single make, while the sample size for the other values of k is 138, 62 and 30 shots respectively.While the estimates from the null shots samples for the error model are closest to the observed field goal percentage for k = 1, for all values of k the model underestimates Walker's probability of making a shot.Therefore, if not adjusting for the model error would lead to overestimation of his hot hand effect.Finally, the last column at Table 1 presents the adjusted hot hand effect size for KW, where the p-values correspond to a one-sided t-test with H 0 : êKW,k = 0, H 1 : êKW,k > 0. As we can see Kemba Walker appears to be "heating up" more, with the more shots he makes.
We performed the same process for all 153 players that took at least 1,000 shots over the two seasons covered by our data.Table 2 presents the league-wide results from the Binomial meta-test described in the previous section (at the 5% significance level).As we can see, for all values of k our analysis provides at least 24 players that exhibit the hot hand effect.The probability of getting at least as many false positives is less than 1-in-1,000,000.Simply put, we are confident that these are true positives for the majority.The average adjusted hot hand effect size weighted by the sample size available for each player, is also shown at the same table.As we can see, it ranges from approximately 1.5% for k = 1, to about 5.8% for k = 4.In general, it appears that the effect size increase with the number of consecutive makes.However, here we have to put these results into context.For k = 4 the average sequence length we have for each player is around 33 shots.Thus, our test can be underpowered and unable to identify small hot hand effects.Therefore, the average adjusted effect size among the players that exhibit the hot hand can be inflated.
We also calculate the average adjusted effect size over all 153 players, regardless of whether they provided evidence of the hot hand or not (last column at Table 2).As we can see this effect size is negative!In essence, overall as a league, players regress to their mean/expected shooting percentages after consecutive makes.Our take away from these results is that while shooting regression appears to be the stronger force over the whole league, there are individual players that exhibit statistically significant streakiness with varied effect size.
In order to further examine the importance of removing the assumption of identically distributed trials from the analysis, we compute the simulated probabilities by simply permuting the shots taken by a player within a game.This is essentially the same test as introduced in 6 to account for the streak selection bias.When ignoring the shot quality none of the qualified players exhibits the hot hand!This further supports our hypothesis that using statistical tests that rely on the identically distributed assumption can lead to severe underestimation of the phenomenon in the wild.We would like to emphasize here that our objective with this experiment is not to compare the conclusions obtained from the two methods, but rather to show the importance of considering the probability of each individual trial, when this is not constant.The permutation test described Miller and Sanjurjo 6 is appropriate whenever examining a series of trials that are identically distributed (e.g., free throws or three point contest shots).Furthermore, in the studies by Miller and Sanjurjo on controlled shooting (both actual three-point contests 19 and a shooting field experiment 20 ) they also examine whether there is hot hand activated between the different contest rounds.They achieve this by considering sequences of consecutive makes that might span different competition rounds.Stratifying the data to more granular groups, was shown to reduce the statistical power of the permutation test, which in our case essentially means that we might be underestimating the phenomenon since we are using one such granular group (i.e., individual games).

Discussion
We do not anticipate that our study will end the hot hand debate -and nor is it our goal -but we believe that it will steer the discussion and analysis more towards the real world situations where the phenomenon might arise, rather than controlled, lab environments.Our results from two seasons of shooting data indicate that overall the league is subject to shooting regression, i.e., players shoot below expectation after consecutive makes, thus, regressing towards their shooting average.However, there are players that exhibit strong statistical evidence for the presence of the hot hand individually.An important context that we have to add in the hot hand analysis in actual game situations is that the presence of hot hand does not necessarily have to do with what fans might have in mind when they talk about a "player getting hot".It can be simply the ability of specific players to hunt and exploit good matchups for them within a game, leading to a streak of successful shots.
While not the focus of our study, identifying the underlying mechanisms that might be responsible for the phenomenon is the natural next question to ask.For example, can we facilitate our preparation for performing in order to trigger this streak?Is there a physical or mental mechanism that can (predictably) lead to a shooting streak in basketball?Is this streak a result of short-term neuroplasticity?Can focus and mental preparation potentially trigger this mechanism?Or is the majority of the effect due to players being able to exploit missmatches as alluded to above ?These are question while not possible to be answered by the observational data we have from success/failure sequences are crucial if we were to gain actionable insights for streaks of success.Furthermore, as far fetched as these hypotheses might sound, they can be plausible if streaks are not random permutations of binary outcomes.Understanding streak patterns (or lack thereof) can have important implications in decision making in areas beyond sports including investing, trading and purchasing behavior 21,22 .Of course, we expect that these mechanisms -if existent -will be different for different areas and the appropriate context needs to be considered as well.
One of the main contributions of this study is the elimination of the assumption that trials within a sequence need to be identical.This is crucial particularly if one wants to study the streaks of success in a natural, not lab-like, environment.While there is literature that has examined streaks in real environments such as career trajectories and professional success 11 , they still make the implicit assumption that each career "trial" has the same probability of success.For example, when exploring the success of a director's movies not all movies have the same probability of being hits.A variety of factors ranging from cast and budget to the timing of its release 23 can factor in the success of a movie.The framework introduced in this report is applicable in a general setting, as long as, there is a model for the probability of success of a trial given appropriate contextual information.More importantly we include in our estimation an adjustment for the error associated with the model for the probability of success of a single trial.
It is worth noting that while the majority of the hot hand literature uses approaches based on combinatorics of binary sequences (such as runs tests, recently permutations tests etc.), regression analysis has also been used to study the phenomenon.In particular, Bocskocsky et al. 24 deal with the hot hand in NBA games as well and build a regression model to estimate the probability of scoring the next shot including as a covariate the player's "heat".The latter is defined as the difference between the actual and the expected success rate over the past k shots.This means that a player can be "hot" (to some degree) even if he missed half of his last k shots.Many might argue that this is not a hot player on a streak, and while there is not a right or wrong way to define what constitutes a hot hand, we choose to build a framework applicable on success/failure sequences, which is the typical setting in the hot hand literature.Furthermore, from a technical standpoint, an ordinary least squares regression was used in that study to model a binary variable -instead of the more appropriate logistic regression.This violates the constant variance assumption of linear regression.Consequently, this means that the t-statistic might not truly follow a t-distribution, leading to biased estimation of the corresponding p-values.Finally, the regression model is also associated with an error and the hot hand effect size needs to be adjusted as well based on the regression residuals.Focusing a bit on the shot make probability model, the main difference with 24 is that we treat it as a classification problem, while Bocskocsky et al. 24 treat it as a linear regression.In terms of inputs the two models are very similar.Some variables used in 24 , such as an indicator on whether the shooter was doubled teamed or not, were not available in our dataset.These models can be improved by using information for the actual trajectory post-release from the shooter.In particular, Daly-Grafstein and Bornn 17 make use of features such as, entry angle to the hoop and entry location in the hoop that improve the prediction performance.However, these features are not available to us, and more importantly, even if they were they would not be particularly useful for our objective.Our goal is to estimate the shot make probability at the moment of release.Of course, these features are extremely useful as shown in 17 for estimating the shooting ability of players with fewer observations.
In conclusion, while we clearly are not the first -or the last -to study the topic of hot hand, we believe that our report contributes to the existing literature in following ways.First, building on Miller's and Sanjurjo work 6 we introduce an empirical hypothesis that can analyze binary sequences of independent but not identically distributed trials.Second, we account for model errors associated with the estimation of the success probability for a trial.More importantly, our results are focused on a setting that has not been examined widely before (i.e., actual game situations).

8 Figure 1 .
Figure 1.(A) The permutation test 6 removes the streak selection bias in small sequences.(B) In an actual game environment we need to consider the shot quality.Our shot make probability model outputs accurate out-of-sample probabilities.
(B)), every model is associated with an error.Therefore, any difference observed between Pr[M| M . . .M k times ] sim and Pr[M| M . . .M k times 3, 4} (to simplify and make the notation more compact, in what follows we will use Pr[M π | M π . . .M π k times] data to refer to this probability).We then compare this probability with the probability Pr[M π | M π . . .M π k times] sim we should have expected based on the quality of the shot(s) player π took, and under the assumption that the shot sequence includes independent but not identically distributed shots.The difference eπ,k = Pr[M π | M π . . .M π k times ] data − Pr[M π | M π . . .M π k times] sim is the hot hand effect size for player π and for a sequence length of consecutive makes of length k.As described in the previous section, in order to estimate Pr[M π | M π . . .M π k times

k
Pr[M| M . . .M k times ] data Pr[M| M . . .M k times ] sim Pr[s i = M] data Pr[s i = M] sim

Table 1 .
Kemba Walker shows significant levels of streakiness, even after adjusting for errors in the shot make probability

Table 2 .
While overall the league exhibits shooting regression (last column), there are players that have the hot hand (HH) to different degrees (second column).Out of the 153 qualified players for our statistical tests at least 24 players provided statistical evidence for streakiness at the 5% significance level for all values of k examined.The probability of obtaining at least 24 false positives at the 5% level is less than 1-in-1,000,000.