The unbiased estimation of the fraction of variance explained by a model

The correlation coefficient squared, r2, is commonly used to validate quantitative models on neural data, yet it is biased by trial-to-trial variability: as trial-to-trial variability increases, measured correlation to a model’s predictions decreases. As a result, models that perfectly explain neural tuning can appear to perform poorly. Many solutions to this problem have been proposed, but no consensus has been reached on which is the least biased estimator. Some currently used methods substantially overestimate model fit, and the utility of even the best performing methods is limited by the lack of confidence intervals and asymptotic analysis. We provide a new estimator, r^ER2, that outperforms all prior estimators in our testing, and we provide confidence intervals and asymptotic guarantees. We apply our estimator to a variety of neural data to validate its utility. We find that neural noise is often so great that confidence intervals of the estimator cover the entire possible range of values ([0, 1]), preventing meaningful evaluation of the quality of a model’s predictions. This leads us to propose the use of the signal-to-noise ratio (SNR) as a quality metric for making quantitative comparisons across neural recordings. Analyzing a variety of neural data sets, we find that up to ∼ 40% of some state-of-the-art neural recordings do not pass even a liberal SNR criterion. Moving toward more reliable estimates of correlation, and quantitatively comparing quality across recording modalities and data sets, will be critical to accelerating progress in modeling biological phenomena.


Introduction
Building an understanding of the nervous system requires the quantification of model performance on neural data, and this often involves computing Pearson's correlation coefficient between model predictions and neural responses. Yet this typical estimator,r 2 , is fundamentally confounded by the trial-to-trial variability of neural responses: a lowr 2 could be the result of a poor model or high neuronal variability.
One approach to this problem is to average over many repeated trials of the same stimulus in order to reduce the influence of trial-to-trial variability. With a finite number of trials, this approach will never wholly remove the influence of noise and its confounding effect, moreover, the collection of additional trials is expensive. A more principled approach has been to account for trial-to-trial variability in the estimation of the fraction of explainable variance or r 2 . Most often, this takes the form of attempting to estimate what the r 2 would have been in the absence of trial-to-trial variability. Here we call this quantity r 2 ER , the r 2 between the model prediction and the expected response (ER) of the neuron (i.e., the 'true' mean, or expected value, of the estimated tuning curve). While a variety of solutions have been proposed to estimate this quantity [1][2][3][4][5][6][7][8][9][10], they have not been quantitatively compared, thus there is no basis to reach a consensus on which methods are appropriate, or more importantly inappropriate. We find that several estimators still in recent use have large biases. Estimators that did have relatively small biases lacked associated confidence intervals, thus the degree of uncertainty in these sometimes highly variable estimates remains ambiguous. Finally, none of these methods have been analyzed asymptotically to give a theoretical guarantee that they will converge to r 2 ER , i.e., it has not been shown that they are consistent estimators.
To address these substantial problems, we introducer 2 ER , which is a simple analytic estimator of r 2 ER , along with a method for generating α-level confidence intervals. We validate our estimator in simulation, prove that it is consistent, and provide head-to-head comparisons to prior methods. We then demonstrate the use ofr 2 ER and its confidence interval on two sets of neural data. We find many cases where neuronal data is so noisy that estimates of r 2 ER provide little inferential power about the quality of a model fit. This naturally leads to a useful metric of the quality of a neuronal recording that we will refer to as the signal-to-noise ratio (SNR), and which can be interpreted in terms of the number of trials needed to reliably detect tuning. Across a diverse set of neural recordings, we find that many neurons do not pass even a liberal criterion for providing meaningful insight into the quality of a model fit.
natural images in area V4. Finally, we develop an estimator, d SNR, based on the signal-to-noise ratio (SNR), as a metric to determine the inferential power of a given neuronal recording.

Bias ofr 2 and its correction
Consider a typical scenario in sensory neuroscience where the responses of a neuron to m stimuli across n repeated trials of each stimulus have been collected and the average of these responses, the estimated tuning curve (Fig 1, dashed green line), is compared to responses predicted by a model (red line). These responses could be spike counts from a neuron but could just as well be any other neural signal. Even if the m expected values of the neuronal response, μ i (solid green trace), perfectly correlate with the model predictions, ν i (red trace is scaled and shifted relative to green), the m sample averages, Y i (dashed green trace), will deviate from their expected value owing to the sample mean's variability. Here, we quantify this variability using the variance, σ 2 , of the distribution of responses from trial-to-trial (see Methods, "Assumptions and terminology for derivation of unbiased estimators"). We assume σ 2 is constant across responses to different stimuli, which can be achieved by applying a variance stabilizing transform to the data. The variance of the sample mean for all stimuli will thus be s 2 n . Owing to the variance of the sample mean, the reportedr 2 can be appreciably less than 1 even though the r 2 between the underlying expected values of the neuronal response and the model is 1.
The quantity we attempt to estimate in this paper is r 2 between the model predictions (ν i ) and the expected neuronal responses (μ i ). We will call this quantity r 2 ER , the fraction of variance of the 'expected response' explained by the model:

PLOS COMPUTATIONAL BIOLOGY
The unbiased estimation of the fraction of variance explained by a model We will show that the naive sample estimator, which uses Y i in place of μ i , has an expected value that can be well approximated as the ratio of the expected values of its numerator and denominator as follows (for asymptotic justification see Methods, "Inconsistency ofr 2 in m"): While the terms on the left in the numerator and denominator are the same as r 2 ER , the terms on the right are proportional to the trial-to-trial variability (σ 2 ) and causer 2 to deviate from r 2 ER . This is the essential problem:r 2 is biased away from r 2 ER by terms proportional to the amount of variability, s 2 n , in the estimated responses. The strategy we take to solve this problem is straightforward: find unbiased estimators of these noise terms and subtract them from the numerator and denominator of Eq 2 forr 2 , thus: whereŝ 2 is an unbiased estimator for trial-to-trial variability, after a variance stabilizing transform if necessary. Typicallyŝ 2 ¼ŝ 2 , the sample variance, but not necessarily. For example if stimuli are shown only once (n = 1), then an assumed value of trial-to-trial variability could be substituted intoŝ 2 . The numerator and denominator of the fractionr 2 ER are unbiased estimators of the numerator and denominator of r 2 ER ; therefore, this solution is approximate since the expected value of a ratio is not necessarily the ratio of the expected values of the numerator and denominator (see Methods, "Bias ofr 2 ER "). Yet we show in simulation that the approximation is very good for typical neural statistics, and we show analytically that, unliker 2 , our estimatorr 2 ER converges to the true r 2 ER as the number of stimuli m ! 1 (see Methods, "Consistency ofr 2 ER in m"). We next evaluate this estimator in simulation.

Validation ofr 2 ER by simulation
To demonstrate the effectiveness and key properties ofr 2 ER , we ran a simulation with m = 362 stimuli, n = 4 repeats, and σ 2 = 0.25 (the trial-to-trial variance of Poisson neuronal response after a variance-stabilizing transform, see Methods: "Assumptions and terminology for derivation of unbiased estimator"). This corresponds, for example, to a minimal experiment to characterize shape tuning in V4 neurons, which requires hundreds of unique shapes and takes on the order of 1 hour [2]. In the case where the model prediction (ν i ) and expected response (μ i ) were perfectly correlated (as in Fig 1) and SNR was moderate at 0.5, the distribution of the naive estimator,r 2 , is centered well below 1 (Fig 2A, blue). Thus, the model appears to be a poor fit to data that it in fact generated, indicating thatr 2 is a poor estimator of r 2 ER . On the other hand, the distribution of our corrected estimator,r 2 ER , is appropriately centered at 1 (Fig  2A, orange). Approximately 50% of the time our estimator exceeds 1, taking on impossible values of r 2 ER 2 ½0; 1�, but this is necessary to achieve unbiased estimates for high r 2 ER because truncating the values would shift the mean below 1.
We evaluated the estimatorsr 2 andr 2 ER at five values of r 2 ER (0, 0.25, 0.5, 0.75, 1) and plotted the mean and 90% quantiles. Fig 2B shows thatr 2 (blue line) grossly underestimates r 2 ER (black line) at all levels except for r 2 ER ¼ 0, whereasr 2 ER (orange line) correctly estimates the true correlation r 2 ER (orange and black lines overlap). Thus the estimatorr 2 ER performs favorably in this simulation. We ran similar simulations on the square root of Poisson distributed spikes counts and found similar results for both low and high average firing rates. Next, we characterize the performance ofr 2 ER relative tor 2 in simulations that cover a wide range of the parameters, m, n and SNR.

Asymptotic properties ofr 2 ER andr 2
We ran simulations to determine the bias and variance ofr 2 ER relative tor 2 as a function of the parameters SNR, n, and m. Fig 3A shows that as SNR increases,r 2 (blue) andr 2 ER (orange) converge (r 2 ER ¼ 0:75, n = 4, m = 362). Thus, for neuronal recordings where variation in response strength across stimuli is large relative to trial-to-trial variability, these two estimators should have similar values. At low values of SNR, e.g., 0.1,r 2 has a large downward bias (mean ER has a small upward bias relative to its own variability and to the bias of r 2 (for the source of this bias see Methods, "Bias ofr 2 ER "). This small upward bias ofr 2 ER quickly diminishes as SNR increases, whereas the large negative bias ofr 2 remains across a much wider range of SNR. The essential problem this simulation reveals is that if SNR varies widely from neuron to neuron, the bias in the naive estimate will cause apparent variation in r 2 across neurons that depends on SNR and not on the underlying tuning curve. Neuronal SNR is not typically under experimental control, making this problem difficult to avoid.
The number of repeats, n, is under the experimenter's control but is expensive to increase. Fig 3B shows howr 2 andr 2 ER converge as n increases. Thus the bias inr 2 can be reduced by increasing the number of repeats, but to achieve this requires a very high number of repeats for low SNR. An advantage ofr 2 ER is that even for low n, it on average estimates the true correlation to the model (orange trace overlaps black trace, Fig 3B), providing a large gain in total trial efficiency for estimating the quality of model fit.
When increasing the number of stimuli, m, unlike the previous two cases,r 2 andr 2 ER do not converge to the same value ( Fig 3C). While variability of both estimators decreases (90% quantiles narrow), it is clear in simulation thatr 2 is not a consistent estimator of r 2 ER in m since it does not converge to r 2 ER ¼ 0:75. While there is a small upward bias ofr 2 ER for low m, as m increases this bias is reduced (see Methods, "Consistency ofr 2 ER in m").

Comparison to prior methods
Accounting for noise when interpreting the fit of models to neural data has been examined and applied in the neuroscientific literature for some time [1][2][3][4][5][6][7][8][9][10]. Several studies have followed the approach of attempting to estimate the upper bound on the quality of fit of a model given noise and then referencing the measure of fit to this quantity. Roddey et al. [1] estimate this upper bound by computing their estimate of model fit, 'coherence', across split trials then plotting coherence of the data to the model predictions relative to the split trial coherence. Yamins et al. [7] normalize r 2 with split-trial correlation transformed by the Spearman-Brown prediction formula (averaged across randomly resampled subsets of trials); we will call thiŝ r 2 norm-split-SB . Hsu et al. [4] also use split-half correlation (averaged across randomly resampled subsets of trials), to estimate an upper bound (CC max ) by a transformation they derive attempting to estimate the correlation of the true mean with the firing rate of the neuron. For purpose of comparison, we square this estimator and call it CC 2 norm-split . Schoppe et al. [8] improve upon this method by giving an analytic form, thus removing the need for resampling. They do this by using the 'signal-power' (SP) estimate developed by Sahani and Linden [3], thus we call their estimator CC 2 norm-SP . Kindel et al. [10] take inspiration from Schoppe et al., except to estimate CC max they measure the correlation of responses from a Gaussian simulation (based on the sample mean and variance of the neural data) with the sample mean. We square their estimator and call it CC 2 norm-PB (PB for parametric bootstrap). Pasupathy and Connor [2] estimate the fraction of total variance accounted for by trial-to-trial variability, intuitively the fraction of unexplainable variance, then use it to normalizer 2 . We call this estimatorr 2 ð1 À SE 2 SS total Þ. With a similar motivation, Cadena et al. [9] provide a metric they call "fraction explainable variance explained" (FEVE). They form the ratio of mean squared prediction error over total variance of the response (except subtracting off an estimate of trial-to-trial variability from both) and subtract this ratio from one. While all of these methods might be intuitively appealing, the quantities to which they converge, and their relationship to r 2 ER is unclear. Unlike the above approaches, we follow a line of research [3,6] that explicitly attempts to construct an unbiased estimator of r 2 in the absence of noise (see Methods, "Prior analytic methods of estimating r 2 ER "). Heretofore many of the methods reviewed above have not been quantitatively validated and none have been directly compared. We now compare all these methods with respect to estimating r 2 ER . We exclude from this comparison David and Gallant [5] because their method depends on a large number of repeated trials, at which point the estimators' utility decreases.
We quantified the ability of all methods to estimate r 2 ER in a simulation with n = 4 trials and m = 362 stimuli (see Methods, "Simulation procedure"). We sort the estimators (Fig 4, y-  performing slightly, but significantly, better. SPE norm and CC 2 norm-SP are numerically identical in their performance and their trial-to-trial results. On the other hand,r 2 ð1 À SE 2 SS total Þ and r 2 normÀ splitÀ SB both over estimate r 2 ER , and the naive estimatorr 2 , as expected, yields an under estimate (mean = 0.50). In addition CC 2 norm-PB underestimates r 2 ER . When the true r 2 ER is 0.5, we find similar results, wherer 2 ð1 À SE 2 SS total Þ andr 2 norm-split-SB produce overestimates (0.63 and 1.04 on average, respectively) and the meanr 2 is 0.25. Thus, serious caution should be taken when interpreting these last two estimators. We applied these estimators to neural data from V4 fit by a deep neural network (see Methods, "Electrophysiological data") and found a similar pattern of results where the top 6 estimators give similar estimates to each other,r 2 ð1 À SE 2 SS total Þ and r 2 normÀ splitÀ SB tend to be greater than these estimators, and the estimatorsr 2 and CC 2 norm-PB are lower ( Fig 4C). We concluder 2 ER is as good as any estimator of r 2 ER available, has a simple analytic form, and in contrast to U, can still be calculated without calculating the sample variance, for example, if no repeats are collected and variance must be assumed (see Discussion, "Relationship to prior methods"). None of the top five prior estimators we reviewed have associated confidence intervals, and thus we now provide confidence intervals forr 2 ER .

ER
In order to interpret point estimates such asr 2 ER , it is important to be able to meaningfully quantify uncertainty about the estimate relative to the true parameter r 2 ER . An α-level confidence interval (CI) provides an interval that will contain the true parameter α × 100% of the time for IID estimates. We considered three typical generic approaches to forming CIs forr 2 ER : the non-parametric bootstrap, the parametric bootstrap, and BC a [11]. We found all methods to be lacking because they did not achieve the desired α in simulations with ground truth. Motivated by these problems, we developed a novel Bayesian method. We first recount the issues we found with the bootstrap methods and then provide a basic account of the Bayesian method we use throughout the paper. For more detailed exposition, see Methods: "Quantifying uncertainty in the estimator".
The non-parametric bootstrap is a commonly used method to approximate CIs. In our case, it involves randomly re-sampling with replacement from the n trials in response to each of the m stimuli then calculatingr 2ðbÞ ER for the bootstrap sample. Repeating this many times allows the quantiles of the bootstrap distribution ofr 2ðbÞ ER to be used as CIs. We applied this method across a simulated population of 3000 neurons with m = 40 and n = 4 and found it suffered from two problems. First, the CIs were not centered around r 2 ER , specifically the interval was too low (Fig 5A), with the upper and lower bounds of the interval (orange and blue traces, respectively) almost always falling below the true value (green). Secondly, as the true r 2 ER increased from 0 to 1, CIs contained r 2 ER at rates far lower than the desired α = 0.8 (Fig 6 cyan trace, open-circles under the trace indicate a significant difference, p < 0.01 Bonferoni corrected z-test). Thus at practically all levels of correlation, the non-parametric bootstrap performs poorly. The problem is a result of the expected value of the empirical distribution (the sample mean) being typically much lower than r 2 ER . To overcome this, we turned to the parametric bootstrap where we could explicitly estimate r 2 ER with our estimatorr 2 ER . This method approximates CIs by estimating the parameters of an assumed distribution from which samples are generated. In our case it involves estimating σ 2 , r 2 ER , and the variance of the neuronal tuning curve d 2 (see Results, "Signal-to-noise ratio as recording quality metric") and then simulating observations from the distribution with these parameters to calculater 2ðPBÞ ER .

ER
we again use the sample quantiles as CI estimates. Fig 5B shows that this overcomes the main failure of the non-parametric bootstrap, but this method tended to be too conservative for low r 2 ER values (Fig 6 red trace below 0.8 at left side) and too liberal for high values (red trace above 0.8 at right side). Deviations such as these are well known for bootstrap percentile methods when the variance is a non-constant function of the mean and/or the distribution of the estimator is skewed [11]. The correction to the bootstrap, the bias-corrected and accelerated bootstrap (BC a ), can help ameliorate these issues by implicitly approximating the skewness and the mean-variance relationship from bootstrap samples. We employed BC a with our parametric bootstrap and found that performance improved relative to the parametric bootstrap (Fig 6 green trace closer to desired α than red for low r 2 ER ) but still deviated from the desired α for low and high r 2 ER . We aimed to create a CI with better α-level performance. To do this, we assumed uninformative priors on the parameters σ 2 and d 2 so that, conditioned on estimates of these parameters, we can draw from the distribution ofr 2 ER jr 2 ER for an arbitrary r 2 ER (see Methods, "Confidence Intervals forr 2 ER "). This allows us to compute the highest true r 2 ER that would have given an observedr 2 ER or a lower value in α/2 proportion of IID samples. We take this as the high end, r 2 ERðhÞ , of our CI. Similarly we determine the low end, r 2 ERðlÞ , of the CI as the lowest r 2 ER that produces a value greater than or equal tor 2 ER in α/2 of the samples. In Methods we give conditions under which this procedure will provide α-level CIs (see Methods, "Confidence intervals forr 2 ER "). In our simulations, this method consistently achieves the desired α at all levels of r 2 ER (Fig 6, orange trace). We use this CI method, which we call the estimate-centered credible interval (ECCI), throughout the rest of the paper.

Application of estimator to MT data
We have shown in simulation that the use ofr 2 introduces ambiguity as to whether a low correlation value was the result of trial-to-trial variability or a poor model, whereasr 2 ER removes this ambiguity. Here we demonstrate in neural data how this, in tandem with confidence intervals, allows investigators to distinguish between units that systematically deviate from model predictions versus those that simply have noisy responses. We re-analyzed data from single neurons in the visual cortical motion area MT responding to dot motion in eight equally spaced directions [12,13]. A classic model of these responses is a single cycle sinusoid as a function of the direction of dot motion with the free parameters phase, amplitude, and average firing rate. We chose this MT data set as the first example application because it has a high number of repeats (n = 10) and a low dimensional model, thus it is simple to visually inspect whether the neuronal tuning curves agree with the model predictions.
An example of an MT neuron direction tuning curve ( Fig 7A, orange trace) has an excellent sinusoidal fit (blue trace), as reflected in its estimatedr 2 ER ¼ 1:0. Furthermore, the short confidence interval ([0.99, 1.0]) quantifies the lack of ambiguity about the quality of the fit. On the other hand, the tuning curve of a neuron withr 2 ER ¼ 0:05 ( Fig 7B) has a clear systematic deviation from the least-squares fit. Here the tuning curve is double-peaked and thus largely orthogonal to any single cycle sinusoid. It is important to notice that this neuron has far lower SNR (2.8 here vs. 20 for the example in A), as quantified by our estimator, d SNR (Eq 16), which corrects for trial-to-trial variability (described below and defined in Methods, "Estimators of correction terms"). Thus without r 2 ER , there would be plausible doubts about whether the correlation was lower because of noise or systematic deviation. Furthermore, with low SNR it would be plausible that the estimate itself is noisy (Fig 3A), but the short confidence interval ([0.01, 0.11]) unambiguously characterizes the fit as being systematically poor.
In some cases, neurons show little tuning for direction and thus have very low SNR over a set of directional stimuli. This in turn can causer 2 ER to give wild estimates (Fig 7C, d SNR ¼ 0:05,r 2 ER ¼ 1:81). If we truncate the value to the nearest possible r 2 ER ¼ 1, we might be

PLOS COMPUTATIONAL BIOLOGY
The unbiased estimation of the fraction of variance explained by a model tempted to interpret this as a well-fit direction selective neuron. But, the CI covers most of the interval of possible values ([0.3, 1]), making it clear that little information can be gleaned about r 2 ER from this data. Extremer 2 ER values themselves can indicate when the estimator is unreliable, but even a reasonable seemingr 2 ER value, for exampler 2 ER ¼ 0:59 (Fig 7D), can be unreliable when there is a low d SNR (0.12). In this case, the confidence interval covers the maximal range ([0, 1]), indicating that the point estimate is unreliable. Thus,r 2 ER and its associated confidence interval quickly and unambiguously show how well the model fits the MT data, avoiding the tiresome and unreliable process of judging each fit by eye for the 162 neurons.
While we have shown to a good approximation thatr 2 ER is unbiased and its expected value is largely invariant to SNR, this is definitely not the case for the variance of the estimator.  variability of these estimates will spread out the distribution, thus this difference in quantiles must be interpreted carefully. When estimating population dispersion, conclusions may be confounded by the SNR-dependence of the variability ofr 2 ER . Comparing the naiver 2 to our unbiasedr 2 ER (Fig 9), the high SNR units (red points), lie close to the diagonal. Thus for these units, one could exchange the two estimates and come to similar general conclusions about model fits. The utility ofr 2 ER is that it removes ambiguity about whether trial-to-trial variability may be spuriously pushing fits down (black points). The interpretation of the naive estimatorr 2 remains ambiguous for any given unit until it can be confirmed it does not suffer from this issue.
The MT data considered here has relatively few stimuli and many repeats, but other experimental paradigms involve a larger number of stimuli and, consequently, fewer repeats. Below we applyr 2 ER in these more challenging conditions.

PLOS COMPUTATIONAL BIOLOGY
The unbiased estimation of the fraction of variance explained by a model

Application of estimator to V4 data
The primate mid-level visual cortical area V4 is known to have complex, high-dimensional selectivity for visual inputs. To rigorously assess models of neuronal responses in areas like V4, validation needs to be performed on responses to a large corpus of natural images to ensure that models capture ecologically valid selectivity [14,15]. Thus, the number of unique stimuli, m, will be large at the expense of having relatively few repeats, n, and SNR can be low because stimuli are not customized to the preferences of a given neuron. These are the challenging conditions under whichr 2 ER avoids the major confounds ofr 2 . Here we estimater 2 ER and associated 90% confidence intervals for a model that won the University of Washington V4 Neural Data Challenge by most accurately predicting single-unit (SUA) and multi-unit activity (MUA) for held-out stimuli (see Methods, "Electrophysiological data"). Plottingr 2 ER againstr 2 (Fig 10A) shows that the corrected estimates are higher than the naive estimates (points lie above diagonal line). Usingr 2 ER here is important because it provides confidence that the poor fit quality is not a result of noise and that the best performing model often did not explain more than 50% of the variance in the tuning curve.
While we have examinedr 2 ER for individual recordings, it can also be useful to estimate the average quality of model fit across a population of neurons. Since the individual estimates are unbiased, the group average is also an unbiased estimate of the population meanr 2 ER . We computed such group means for the single-unit and multi-unit V4 recordings (Fig 10B), and found that the model performed significantly better in predicting the responses of multi-unit activity (Welch's t-test p = 0.005, MUA mean = 0.57, SUA mean = 0.35). If instead the naiver 2 were used, this finding could have been dismissed as the result of MUA having higher SNR and thus naturally higherr 2 . As it stands, this interesting observation can be followed up to potentially gain insight about the structure of selectivity across multiple units recorded nearby in V4.
Finally, this V4 data set provides a good example of how usingr 2 ER can allow testing a larger stimulus space, as predicted by simulations above in Fig 3B. Fig 11 shows that withr 2 ER (solid lines, on average two trials is enough to estimate the true correlation, whereas the naive estimator requires more repeats (higher n) to converge. For example, for recording 1 (red),r 2 ER (solid line) on average predicts the same quality of model fit for two or more stimulus repeats, whereas even after six repeats, the naiver 2 has not converged.

Signal-to-noise ratio as recording quality metric
We have shown above that correcting for bias in r 2 is important, but it is also critical to recognize when recordings are so noisy that they are effectively useless for evaluating a model. Here we demonstrate the use of the signal-to-noise ratio (SNR) as a quality metric to help make this determination. We define the SNR for a neuronal tuning curve to be the ratio of the variation in the expected response across stimuli to the trial-to-trial variability across repeats: where μ i is the expected response to the ith stimulus and � m ¼ 1 For experimental data, we do not know μ i in Eq 5, and rather than substituting sample estimates, Y i , which would give an inflated estimate, we use an equation that corrects for trial-to-trial noise ( d SNR, Eq 16, Methods). The removal of this bias in our SNR metric allows for direct comparisons between studies with different numbers of repeats and amounts of trial-to-trial variability. This is appropriate because SNR is not a function of n or m, rather it can vary across neurons, sets of stimuli and recording modalities, as we show below.
We examined a diverse collection of neural data sets (see Methods, Electrophysiological data) and found wide variation in d SNR both within and across the data sets (Fig 12A). At the low end, calcium imaging data from cortical neurons in area VISp of mouse responding to gratings (pink trace N = 40,520 neuronal ROIs, [16]) had a median SNR of 0.01, while at the high end, MT neurons in response to dot motion [12]  PLOS COMPUTATIONAL BIOLOGY N = 162). A stimulus protocol nearly identical to that used for the VISp Ca 2+ data (pink and gray traces for gratings and natural images, respectively) was used to collect the Allen Institute NeuroPixel electrode data [17] (purple and brown traces N = 2,015); however, the Ca 2+ data had a substantially lower d SNR (0.01 and 0.02) compared to the electrode data ( d SNR 0.12 and 0.19), suggesting that this difference relates to the recording modality.
In the case of spiking neurons, SNR can be improved by increasing the stimulus duration and thus the spike counting window. Under the generally optimistic assumption that spike rate stays constant in the counting window and assuming that the spike counting window could be changed given experimental constraints, we can normalize d SNR across the data-sets to what the d SNR would have been had all spike count windows been 1 second long (Fig 12B). Under these assumptions, this normalization allows us to examine SNR differences across studies removing the counting window length as a factor. We find this reduces the differences in d SNR across the spiking data-sets (the six right-most traces), thus the outstanding d SNR of the MT data-set could potentially have been achieved if spike count windows had been longer for the other experiments. Still, of the spiking data, the Allen Neuropixel data has the lowest medians, thus additional efforts to ameliorate low SNR (via number of trials or stimulus choice) could be utilized. Furthermore, the assumption of a constant spike rate will hold to : V4 data has n � 5, m � 350. Thin lines: Allen Inst. data has n � 50, m � 120. The Allen Inst. data has two recording modalities: extracellular action potentials (spikes) on Neuropixel probes (NP) and two-photon calcium imaging (Ca). Both were recorded for the same stimuli: natural scenes and gratings (see Methods, "Electrophysiological data"). (B) Distribution of d SNR after normalization with respect to the duration of the spike counting window (traces for calcium signal are not included). The normalization assumes that the original average spike rate can be applied to a 1 s counting window. But, if firing rates tend to decay over time, this will produce overestimates for recordings shorter than 1 s and underestimates for recordings longer than 1 s. https://doi.org/10.1371/journal.pcbi.1009212.g012

PLOS COMPUTATIONAL BIOLOGY
The unbiased estimation of the fraction of variance explained by a model different degrees: neural responses can peak shortly after stimulus onset and then return close to baseline. Thus, different experimental conditions call for different standards for number of trials and stimulus duration to adequately characterize a tuning curve.
To provide concrete meaning to d SNR, we suggest interpreting it in terms of the number of trials (m and n) needed to reliably detect stimulus modulation in an F-test. Specifically, for a given m and n we computed the minimal SNR required to achieve a high probability (β = 0.99) of rejecting the null hypothesis that the mean response to all stimuli is the same (see Methods, SNR relation to F-test and number of trials, Eq 22). We plot a color map of this minimal SNR as a function of m and n (Fig 13), where the diagonal grey contour lines indicate fixed total number of trials (mn) for different m: n ratios. In general, as the total number of trials increases (moving perpendicular to the grey diagonals toward the upper right), the SNR required for reliable tuning curve estimates decreases. The SNR threshold is lower when n is favored over m for the same number of total trials, i.e., the SNR threshold level iso-contours have steeper slopes than the grey diagonals.
On this map, we can locate points corresponding to the m and n, roughly, for data sets in Fig 12. The three V4 data sets have about the same number of stimuli and repeats (arrow marked "V4" , Fig 13), and thus require SNR�0.1 or greater, implying that from 3% to 23% of the V4 data does not pass the criterion (Fig 12A, red and green traces, respectively, define endpoints). The MT data has the fewest number of total trials and thus has the highest threshold SNR � 0.5, which leaves 10% of the neurons with poorly estimated tuning curves. If more stimuli had been used at the expense of fewer repeats, say n = 2 and m = 40, then only a quarter of the neurons would have exceeded the increased threshold of SNR > 1. The Allen Ca 2+ and spike data sets both had similar m and n. Relative to the other data sets they had far more total trials and a greater number of repeats, thus the SNR criterion is substantially lower (SNR > 0.01). Still, for the Ca 2+ data, � 37% of the grating and � 25% of the natural image data did not have reliable tuning (Fig 12A, pink and grey thin trace). The Allen spiking data on the other hand had much higher SNR, and thus more trials could have been spent on expanding the stimulus set and fewer on repeated presentation (Fig 12A, thin brown purple trace). We have shown SNR can be employed as a simple metric with a concrete interpretation to judge data quality across different organisms, recording modalities and brain regions for the purpose of making comparative analyses and setting aside data that has little or no power. The expected distribution of SNR, based on prior data, can be taken into account when choosing n and m to achieve a criterion level of statistical power for an experiment. If SNR is high, recording time can be reduced by keeping n and m low, or a larger stimulus set can be explored by increasing m at the expense of n.

Summary
We have investigated the estimation of the correlation between a model prediction and the expected response of a neuron. Although it has been long established that trial-to-trial variability will cause the classic estimator, Pearson'sr 2 , to underestimate correlation, there has been no direct comparison of prior methods to account for this confound. We found that some methods grossly over estimate correlation in high noise conditions, and we built upon the best performing method to derive a more generally applicable estimator,r 2 ER , that performs as well as or better than prior methods. We analytically validatedr 2 ER by determining that it was a consistent estimator in the number of stimuli. We found in simulation that it had a small upward bias, but this was only appreciable at very high noise levels. None of the prior methods that we examined had validated confidence intervals, thus we developed confidence intervals forr 2 ER . Motivated by the failure of generic bootstrap methods to achieve satisfactory confidence intervals, we developed a confidence interval method that outperformed them.
Applying our estimator to neural data, we demonstrated its essential value. In the case of MT recordings, it was able to unambiguously distinguish neurons for which a sinusoidal model was a good fit from those for which it was a poor fit specifically because of systematic deviation and not because of noise. The associated confidence intervals allowed the systematic identification of noisy recordings that served no practical use in assessing the fit of the model. Poor model fits caused by noise vs. those caused by systematic differences in selectivity have very different interpretations, yet the traditionalr 2 does not differentiate them whiler 2 ER does. Application of the estimator to the winning UW neural data challenge model, a deep neural network (DNN), provides the only validated assessment of state-of-the-art predictive model performance in V4. The estimator along with its CIs identified neurons that were challenging to the DNN and perhaps require a different modeling approach. It also validated the existence of single units that had nearly 50% of their variance explained, indicating that the DNN functionally captured a substantial part of what these units encode across natural images and thus could provide real insight into naturalistic V4 single unit encoding. On a practical level, we showed how the estimator allows for gains in trial efficiency since it converges more rapidly thanr 2 (Figs 3B and 11). This is important when many stimuli are needed to validate models of high dimensional neural tuning.
Our tests on experimental data revealed that some neurons had confidence intervals covering the entire range of possible values, motivating us to propose the signal-to-noise ratio (SNR) as a metric of neural recording quality in the context of model fitting. We provide an unbiased estimator of SNR (Eq 16) and a practical interpretation: for a given number of stimuli and repeats, the SNR should be sufficient to reliably detect stimulus-driven response modulation on the basis of an F-test (Eq 22). Examining a variety of data sets, we found differences with respect to how the numbers of stimuli vs. repeats (m vs. n) were balanced, revealing how adjustments can be made on the basis of SNR to improve experimental efficiency. We also found large differences in SNR across data sets that are likely related to recording modality (e.g., Ca2+ imaging vs. electrode recording), which could be important for selecting appropriate experimental approaches and for guiding the refinement of current techniques to improve SNR.

ER
We have introduced an estimator and confidence intervals for the correlation between the true tuning curve of a neuron (its expected value across stimuli) and model predictions:r 2 ER . In the context of sensory neurophysiology, we believe it is reasonable to think of r 2 ER as reflecting solely how well a model explains a sensory representation. We justify this by the fact that r 2 ER is solely a function of E[Response|Stimulus] thus solely a function of stimuli. We note two caveats: (1) non-sensory signals can influence sensory responses, e.g., eye movements which may be stimulus dependent and (2) E[Response|Stimulus] is not the only component of the sensory response, e.g., variability can also be stimulus dependent [18].

Relationship ofr 2 ER to U
We have taken a similar approach to Haefner and Cumming [6], commensurately their estimator gives nearly identical results to ours in simulation (less than < 0.0001% power unexplained for SNR = 1 vs >1% for the other estimators), though we provide an important generalization. Their formula requires the calculation of the sample variance because their derivation relies on the F-distribution formed by taking the ratio of the sum of squares of model residual over the sample variance (see Methods, U). This is problematic if stimuli are never repeated in an experiment (for example, in free viewing experiments), then one has to assume a priori the trial-to-trial variability either from previous experimental measurements or by asserting a theoretical mean-variance relationship (e.g., the square root of Poisson distributed spiking gives s 2 ¼ 1 4 ). Haefner and Cumming's estimator is more general than ther 2 ER we have presented. The estimater 2 ER measures the variance of the mean centered data explained by a single covariate, and U measures the variance explained by a linear combination of up to m covariates. In particular, U, accounts for the decrease in degrees of freedom available to the noise as more covariates are added. We also provide the more general version of our estimator for the case of variance explained by a linear model (Eq 23) with the advantage discussed in the previous paragraph.

SNR
We found that differences in SNR can be substantial and widely varying across neurons, data sets, and recording methods. Given the rise in large scale recordings and sharing of neuronal data, we believe unbiased estimates of SNR should be reported so that researchers can quickly judge whether a data set has sufficient statistical power or whether its power is on par with that of data sets from potentially comparable studies. We provide concrete criteria by which to interpret SNR: the statistical power to detect stimulus-driven response modulation. Strikingly, in our small sample of data sets, many neurons do not pass this criteria, suggesting that the adoption of a standard criterion for data quality, such as our SNR metric, could have a major impact in practice. Furthermore, guided by the metric the experimentalist can take steps to improve SNR by increasing stimulus duration and associated spike counting windows or by customizing stimuli to the preferences of a neuron. On the other hand, the deleterious effects of low SNR can be ameliorated by favoring repeats over number of stimuli (Fig 13).
One conceptual interpretation of the SNR metric we introduced is that it quantifies, for the time scale of the spike count window, the overall variance in the responses of the neuron attributable to the tuning curve of the neuron vs. trial-to-trial variability about that tuning curve. For example on the time scale of 1 second, a large fraction of spike-based recordings had SNR>1, indicating that more variance was caused by the stimulus than by other sources (Fig 12B blue, orange, green traces median SNR > 1). Still, an appreciable number of neurons were dominated by their trial-to-trial variability. Whether this is the result of stimulus choice and perhaps would be different in a more natural context is an open question. Recent theoretical and experimental work has argued that weakly tuned and untuned neurons can contribute to sensory encoding [19][20][21]. The corrected estimate of SNR we provide (Eq 16) along with naturalistic stimulation can help to identify such neurons.

Further work
Small improvements to ourr 2 ER estimator could be made by decreasing its bias in the case of very low SNR (see Methods, "Bias ofr 2 ER "). In the case of very low SNR, a single neuronal recording has little inferential power, but across a population of neurons, estimates of the average correlation to a model's predictions can have low enough variance to provide useful inference. Yet, at very low SNR an appreciable bias begins to appear that will remain in the population average. We showed this bias is a function of the covariance between the numerator (Ĉ ER m ) and inverse of the denominator ( 1 (Eq 17). The former covariance can be removed by using separate subsets of the data for estimation of the numerator and denominator. To reduce the influence of Jensen's gap, further work could attempt to directly estimate and correct for its value. In addition, analytic results on how this small sample bias varies as a function of critical parameters m, n and SNR would be helpful in it's interpretation.
In the derivation of our estimator we assume the m responses across which the model predictions are evaluated are independent. Thus the estimator in its current form would not be appropriate for evaluating models that make predictions across spike counts in adjacent time bins. In future work we plan to extend our estimator to the case of correlated responses.
Here we have derived an estimator for the case where deterministic model predictions are correlated to a noisy signal. Often, one noisy signal is correlated to another, for example when judging the similarity of tuning curves from two neurons (termed signal correlation). We have extended the methods described here to the neuron-to-neuron case and will describe this in a forthcoming publication.
A subtle but important point about our estimator is that it assumes stimuli are fixed: it estimates the r 2 ER for the exact stimuli shown. An investigator may be interested in the quality of a model across a large corpus of natural images of which only a small fraction can be included in a recording session. In this case, one collects responses for a random sample of images, fits the model to some (training set) and tests the model on others (test set). The random test set will account for over-fitting and usingr 2 ER will account for noise in the neural responses in the evaluation on the test set. Crucially though, this does not account for the variability in the parameters of the model induced by the random training sample. Intuitively, estimated model parameters will vary across image sets even in the absence of trial-to-trial variability. The correct interpretation ofr 2 ER in this case is that it estimates how well a model can perform given finite noisy training data on noiseless test set data, and not as the best the model could possibly perform given infinite training data. Indeed, with more neural responses and less noise, model test set performance would improve. David and Gallant [5] explored this issue calling it 'estimation noise' and provided an extrapolation method for estimating the fit of a linear model given unlimited stimuli. The estimator was not evaluated in terms of its bias or variance, and no analytic solutions that directly remove the bias of finite training data have been proposed. Both are valuable directions to pursue: the former to build confidence in the current method and the latter for potential gains in trial efficiency. A data driven re-sampling approach may be unavoidable when evaluating complex models where the relationship between the amount of training samples and model performance would be analytically intractable, such as a deep neural network or biophysical model.

Simulation procedure
To simulate model-to-neuron fits, the square root of neural responses, r i,j , for the ith of m stimuli and the jth of n trials are modeled as independent normally distributed responses: where variance σ 2 is the same across all Y i,j . The mean response of the neuron to the ith stimulus (tuning curve) is ER . From this model we draw n responses for each of the m stimuli and apply our estimator to this sample. We repeat this across many IID simulations to accumulate reliable statistics.

Assumptions and terminology for derivation of unbiased estimators
Below we derive an unbiased estimator of the fraction of variance explained when a known signal is being fit to noisy neural responses. For this derivation, we assume the responses have undergone a variance stabilizing transform such that trial-to-trial variability is the same across all stimuli. For example, if the neural responses are Poisson distributed, Y i,j � P(λ i ), where Y i,j is the response to the jth repeat of the ith stimulus, which has expected response λ i , then a variance stabilizing transform is the square root. The expected value of the transformed response, Y � i;j , still increases with λ i , whereas the variance is now approximately constant. To improve the estimate of the mean response, n repeats of each stimulus are collected. Invoking the central limit theorem, we can make the approximation: where the average across the n repeats is approximately normally distributed with variance decreasing with n. The assumption of a Poisson distributed neural response is not always accurate. A more general mean-to-variance relationship, can be approximately stabilized to 1 by, A square root will stabilize any linear mean-to-variance relationship (b = 1), but an unknown slope, a, requires that this parameter be estimated. In the case of the linear relationship, this simply requires taking a square root and then averaging the estimated variance, which is constant, across all stimuli. To account for more diverse mean-to-variance relationships, the Box-Cox technique can be used find an appropriate exponent by which to transform the data [22]. For the derivation below, we assume that variance-stabilized responses to n repeats have been averaged for each of m stimuli to yield the mean response to the ith stimulus: where σ 2 is the trial-to-trial variability and μ i the ith expected value.

Unbiased estimation of r 2
Given a set of mean neural responses, Y i , and model predictions, ν i , the naive estimator,r 2 , is calculated as follows:r Our goal is to find an estimator such that, where r 2 ER is the correlation in the absence of noise, i.e., the fraction of variance explained by the model prediction, ν, of the expected response (ER), μ i , of the neuron. Our strategy will be to remove the bias in the numerator and denominator separately and then reform the ratio of these unbiased estimators for an approximately unbiased estimator.
Unbiased estimate of numerator. The numerator of Eq 7, which we callĈ m , is a weighted sum of normal random variables that is then squared, thus it has a scaled non-central chisquared distribution: and since E½w 2 m ðlÞ� ¼ l þ m its expectation is:

PLOS COMPUTATIONAL BIOLOGY
The unbiased estimation of the fraction of variance explained by a model In the final line, the term on the left is the desired numerator and the term on the right the bias contributed by σ 2 . To form our estimator,Ĉ ER m , for the numerator of Eq 15, we simply subtract an unbiased estimator of this bias term from the numerator of the naive estimator 2: whereŝ 2 is typically the sample variance, s 2 , estimated from the data, but it can be any unbiased estimator, even an assumed constant. For example, if stimuli are not repeated (i.e., n = 1) and one is willing to assume that responses are Poisson distributed, then the square root of these responses will give s 2 ¼ 1 4 and thus one can substituteŝ 2 ¼ 1 4 . The case for the denominator is similar.
Unbiased estimate of denominator. The denominator of Eq 7, which we callV m , is a weighted sum of squared normal random variables and thus also follows a scaled non-central chi-squared distribution: with expectation, Similarly to the numerator, the first term is the desired denominator, and the second term is the bias. Thus, we subtract an unbiased estimate of this second term from the naive denominator:V Taking the ratio of these two unbiased estimators (Eqs 11 and 14) we have: This equation can be further simplified by scaling the model predictions such that P m i¼1 ðn i À � nÞ 2 ¼ 1.
Estimators of correction terms. Two important parameters, d 2 ¼ 1 m P m i¼1 ðm i À � mÞ 2 and σ 2 , are unknown. Below we provide unbiased estimators of each of these terms. An unbiased estimate of sample variance for trials of the ith stimulus is s 2 where the dot in the subscript of � Y i;� indicates the mean over repeats. Assuming the variance is the same across stimuli, we can average over i for a global estimate: Throughout the paper we use this as our estimate of trial-to-trial variabilityŝ 2 .
For d 2 we have: which would be inflated by trial-to-trial variability, so as an unbiased estimator we use, We use this estimator to correct the estimate of SNR (Eq 5) for trial-to-trial variability as follows: Bias ofr 2

ER
To remove the bias of Pearson'sr 2 , we follow the approach of subtracting off its effect in the numerator and denominator. Prior work has not examined the potential problem with this approach: the expectation of a non-linear transformation of a set of random variables is not necessarily the transformation of their expected values. In this particular case, the expectation of the ratio is not necessarily the ratio of the expectations: Thus even though we have removed the bias in the numerator and denominator, it does not imply their ratio is unbiased. Calculating the expectation of the ratio we see the conditions under which it will be unbiased: ER is not unbiased for r 2 ER what recommends it over the naiver 2 ? While we mainly focused on how in simulation for typical ranges of parameters it has a lower bias (Fig 3) it also has a theoretical justification. As we saw in simulation, as the number of stimuli, m, increases, its bias diminishes while that ofr 2 does not (Fig 3C). Convergence to the parameter of interest, otherwise known as consistency, gives a theoretical justification for an estimator. Below we show thatr 2 ER is consistent for r 2 ER whiler 2 is not. We note that the covariance in Eq 17 can be removed by using separate subsets of the data for the estimation ofĈ ER m andV ER m . This leaves the inflation by Jensen's inequality E½ 1 E½V ER m � , which could be estimated and corrected for via a simulation-based method such as the parametric bootstrap (see Discussion, "Further work").

Consistency ofr 2 ER in m.
We aim to show thatr 2 ER is consistent for r 2 ER in m, more formally: Pðjr 2 ER À r 2 ER j � �Þ ¼ 0: We make use of the continuous mapping theorem that guarantees if a random vector X m ! pc , then for a continuous transformation g, gðX m Þ ! p gðcÞ where the random vector is almost surely different from any discontinuity points. Taking our random vector to be, ½Ĉ ER m ;V ER m � T , and our continuous transformation to be, gð½Ĉ ER m ;V ER m � T Þ ¼Ĉ ER m V ER m ¼r 2 ER (assuming expectation of the denominator is non-zero), it then suffices to show thatĈ ER m andV ER m themselves are consistent estimators for the numerator and denominator of r 2 ER . First, we have already shown thatĈ ER m andV ER m are unbiased estimators. Next, we must show that their variance is decreasing with m, and then via Chebyshev's inequality, we can show their convergence to their expectation. Here we consider the case whereŝ 2 ¼ s 2 .
Since the model predictions (ν i ) are fixed for the purpose of the proof, we assume the dot product between model predictions and neural responses is scaled linearly by m: as is the dynamic range of the neuron: and we scale the numerator and denominator ofr 2 ER by 1 m which makes no change to their ratio.
The numerator, , has variance equal to the sum of the variance of its first and second term (since they are independent). Since Var½w 2 m ðlÞ� ¼ 2m þ 4l the variances are, respectively, The denominator, , also has variance equal to the sum of the variance of its first and second term (by independence). The variances are, respectively, and Var ðm À 1Þ nm s 2 For both Var½V ER m � and Var½Ĉ ER m �, all but m is constant; therefore, we can find an m to scale variance below any given �. So by Chebyshev's inequality we have: Since as m ! 1, and similarly,V Thus by the continuous mapping theorem: In contrast, we show below that the naive estimator is not consistent and provide insight into when the difference betweenr 2 andr 2 ER is large. Inconsistency ofr 2 in m. Similarly to the previous derivation, we can take the numerator and denominator ofr 2 (Eqs 9 and 12), scale by 1 m , find their expected values, and in turn find the asymptotic value ofr 2 . Here, though, we simplify by setting the model to be unit length, c v : This result shows thatr 2 is not a consistent estimator in m of r 2 ER .

Confidence intervals
Here we develop and prove a method that provides α-level confidence intervals for the estimatorr 2 ER . We considered the typical parametric bootstrap and non-parametric bootstrap approaches, but found that they were not reliable for typical ranges of parameters (see Results, "Confidence intervals forr 2 ER ").
Our approach hinges upon finding the lowest r 2 ER whose distribution would give an estimate greater than the observedr 2� ER with probability α/2, calling this r 2 ERðlÞ , and finding the highest r 2 ER that would give an estimate less then the observedr 2� ER with probability α/2, calling this r 2 ERðhÞ . The interval ½r 2 ERðlÞ ; r 2 ERðhÞ � then serves as our α-level confidence interval (see Fig 14 for graphical explanation). We use a Bayesian framework to sample from the probability distribution, f ðr 2 ER jr 2 ER Þ, parameterized by the observed neural statistics s 2 andd 2 , allowing us to find ½r 2 ERðlÞ ; r 2 ERðhÞ � under assumed uninformative priors on σ 2 and d 2 (see "Computing confidence intervals", below).
Proof of α-level confidence intervals. Here, we justify this procedure for the case of r 2 ERðhÞ (r 2 ERðlÞ is similar). Our two main assumptions are that the cumulative distribution Fðr 2 ER jr 2 ER Þ is stochastically increasing in r 2 ER , and that we can always find an r 2 ER such that for any observedr 2� ER , Fðr 2� ER jr 2 ER Þ ¼ a 2 ð0; 1Þ: We now consider two mutually exclusive possibilities. First, with probability 1 − α/2, the Under the assumption the family of CDFs of X are stochastically increasing in μ, the event that T(x) � α/2 corresponds to the event that μ < μ U , thus the upper limit of the confidence interval contains the true value of μ. In graphical terms, if the black horizontal dashed line is above the purple, then it is guaranteed that the purple vertical dashed is to the right of the black. Thus these two events have the same probability: Pr(μ � μ U ) = Pr(α/2 � T(X)) = 1 − α/2. Here we have used generic symbols for illustrative purposes, but for reference to the proof (see Methods, "Proof of α-level confidence intervals"), the notation used here correspond as follows: X ¼r 2 ER , Thus, under repeated sampling, with the desired probability α/2, the upper limit of our confidence interval, r 2 ERðhÞ , does not contain r 2 ER . The proof for the lower end of the confidence interval r 2 ERðlÞ is similar. The probability of the mutually exclusive events that either r 2 ER > r 2 ERðhÞ or r 2 ER < r 2 ERðlÞ is the sum of the probability of the two events, α. See Fig 14 for a graphical explanation of this proof.
For simplicity of the proof, we assumed that it was possible to find Fðr 2� ER jr 2 ERðhÞ Þ ¼ a=2, which is not necessarily the case because r 2 ERðhÞ 2 ½0; 1� is bounded butr 2 ER is not. If Fðr 2� ER jr 2 ER ¼ 1Þ > a=2 or Fðr 2� ER jr 2 ER ¼ 0Þ < a=2, then there is no r 2 ERðhÞ that will achieve α/2. Under the condition where Fðr 2� ER jr 2 ER ¼ 1Þ > a=2, we simply set r 2 ERðhÞ ¼ 1, and since Fðr 2� ER jr 2 ER ¼ 1Þ > a=2 implies Fðr 2� ER jr 2 ER 2 ½0; 1�Þ > a=2, the confidence interval will contain the true value. Under the condition Fðr 2� ER jr 2 ER ¼ 0Þ < a=2, we set r 2 ERðhÞ ¼ 0, but we must set the confidence interval, though normally inclusive, to be non-inclusive. Intuitively, this is because if r 2 ER ¼ 0, then the upper end of the confidence interval would always contain the true value, and we would be restricted to α = 1. Making the CI non-inclusive avoids this problem. Doing this does not cause a problem when the true r 2 ER js 2 ;d 2 ; r 2 ER Þ, we assume that σ 2 and d 2 follow an uninformative non-negative uniform prior (U[0, 1]), and given their observed estimates s 2 andd 2 , we obtain samples from the posterior distribution of σ 2 and d 2 via the Metropolis-Hastings sampling method (for details see "Bayesian model and simulation"). For a chosen r 2 ER (e.g., a candidate for r 2 ERðhÞ , we sample from f ðr 2 ER js 2 ;d 2 ; r 2 ER Þ by drawing samples of σ 2 and d 2 from the posterior distribution f ðs 2 ; d 2 js 2 ;d 2 Þ while r 2 ER is fixed to the desired value. Thus for each sample we then draw observations Y and predictions ν i from the model described in Eq 6 and finally calculater 2 ER for a sample from f ðr 2 ER js 2 ;d 2 ; r 2 ER Þ. Computing confidence intervals. We use a simple iterative bracketing algorithm to narrow down the range of candidate values for the ends of our confidence interval. For example, to estimate r 2 ERðhÞ within [0, 1], we first evaluate the highest possible value: 1. We sample N = 2,500 draws from f ðr 2 ER js 2 ;d 2 ; r 2 ERðcÞ ¼ 1Þ to find the proportion,p, of those less than or equal tor 2� ER . We then calculate a z-statistic to test whether this is significantly different from the desired α/2: At some desired significance level (here p < 0.01), we either do not reject the null and accept r 2 ERðhÞ ¼ r 2 ERðcÞ , or we reject the null. In the latter case, if z is positive we determine that r 2 ERðhÞ must be higher, whereas if z is negative it must be lower. In the case where r 2 ERðcÞ ¼ 1 and z is positive, there are no higher possible values of r 2 ERðhÞ and thus we accept r 2 ERðhÞ ¼ r 2 ERðcÞ . Otherwise, on the next step we choose a new candidate by sampling from r 2 ERðcÞ � U½0; 1� then evaluating the result and if we reject the null and z is positive our new interval will be ½r 2 ERðcÞ ; 1� and if z is negative ½0; r 2 ERðcÞ �. Otherwise, if we do not reject the null we accept r 2 ERðhÞ ¼ r 2 ERðcÞ . We continue this bracketing until we do not reject the null or a pre-determined number of splits has passed (here we use 100). Accuracy of this algorithm will increase with number of splits and simulation samples.
Confidence interval validation. We used simulations to evaluate our confidence interval methods under the sampling distribution f ðr 2 ER js 2 ;d 2 ; r 2 ER Þ. Conceptually, this is the distribution ofr 2 ER after data has been collected and sample variance and sample dynamic range calculated, and now we wish to calculate the data's fit to a model with unknown but fixed r 2 ER . To demonstrate that our method contains the unknown r 2 ER at the desired α, our procedure is as follows. For a chosen n, m, σ 2 , d 2 , and r 2 ER , sample an n × m data matrix (Y) and calculate its sample variance s 2 and dynamic ranged 2 . Then using the Metropolis-Hastings algorithm (see Methods, "Bayesian model and simulation"), draw 5,000 samples from the posterior distribution f ðs 2 ; d 2 js 2 ;d 2 Þ. Next, we simulate the distribution ofr 2 ER for each of these data samples by drawing from f ðr 2 ER js 2 ; d 2 ; r 2 ER Þ. For each of these draws, we construct confidence intervals, and then we calculate the proportion of times that the confidence intervals contain the true r 2 ER . This proportion estimates the true α level of the confidence interval method. Bayesian model and simulation. We sample from the posterior of two parameters: σ 2 and Their associated sufficient statistics are: where the equality is derived by recognizing the sample variance (ŝ 2 ) and dynamic range (d 2 ) are independent and setting the prior to be uniform non-negative. The estimatesŝ 2 andd 2 are fixed, calculated from the data, and our goal is to look up the distribution of the parameters given these fixed values. We use the Metropolis-Hastings algorithm to draw from the desired If a > 1, we accept the candidates as our new current samples, but if a < 1, we then draw from u � U [0, 1]. If u < a, we also accept the candidates but if not, we retain the current samples. Throughout the paper we run the chain for 5,000 iterations then randomly sample with replacement from it.

SNR relation to F-test and number of trials
Our goal is to be able find for a given SNR and number of repeats the number of stimuli needed to reliably detect tuning under an F-test. To calculate the F-statistic for testing whether there is variation in the expected responses across stimuli (i.e., stimulus selectivity), we form the ratio, where for clarity we indicate dimensions averaged over with a dot. The numerator calculates the amount of variance explained by stimuli and the denominator calculates the amount of variance unexplained by stimuli. The numerator is a scaled non-central χ 2 distribution: where the final equality comes from the definition of SNR (5). The denominator is a central χ 2 distribution: Thus taking the ratio we have a singly non-central F-distribution: To test for significant tuning, we set an α-level criterion, c F(α) , under the null hypothesis that the observed F-statistic is from a central F-distribution: P½F mÀ 1;mðnÀ 1Þ � c FðaÞ � ¼ a: Finally, given m and n, we can find the SNR where, for some high probability β, P½F mÀ 1;mðnÀ 1Þ ðmnðSNRÞÞ > c FðaÞ � ¼ b: We set β = 0.99 and α = 0.01 and numerically solve for the SNR.

Electrophysiological data
We reanalyzed a variety of neuronal data from previous studies. This includes three experiments in area V4 and one in MT of the awake, fixating rhesus monkey (Macaca mulattta), as well as spiking and two-photon imaging in awake mouse VISp. Experimental protocols for all studies are described in detail in the original publications. From Pasupathy and Connor [2], we examined responses of 109 V4 neurons to a set of 362 shapes. There were typically 3-5 repeats of each stimulus, but we used only the 96 cells that had at least 4 repeats for all stimuli. We used the spike count for each trial during the 500 ms stimulus presentation.
From Popovkina et al. [23], we examined responses of 43 V4 neurons (7 from one monkey, 36 from another) to filled stimuli (drawn from the same set of shapes used for the previous study) and to outline stimuli that were the same except the fill was set to be equivalent to background color. Stimulus color and luminance were customized to elicit a robust response from the recorded neuron. Spikes were counted over the 300 ms duration of each stimulus presentation.
From the 2019 UW V4 Neural Data Challenge, we examined single unit (SUA) and multiunit (MUA) data from 7 V4 recordings. Up to 601 images were shown with between 3-20 repeats for each image. The images were drawn semi-randomly from the 2012 ILSVRC validation set of images [24] where an 80X80 pixel patch was sampled and had a soft window applied (circular Gaussian, SD 16 pixels, applied to the alpha channel). Images were shown for 300 ms with 250 ms in between images. The model we analyze was the winner of the Neural Data Challenge (out of 32 competitors) on held-out data from the 14 sets of V4 responses to natural images.
From Zohary et al. [12,13], we examined responses from 81 pairs of MT neurons recorded from three awake rhesus monkey (Macaca mulatta) viewing dynamic random dots (stimuli described in Britten et al. [25]). Optimal speed of drifting dots was found for the one of the two neurons being recorded. Eight different directions of motion at 45˚increments were repeated 10-20 times. Monkeys performed a two alternative forced choice task of motion direction discrimination during the experiment. Post-stimulus spikes were counted in the 2 s window of stimulus presentation. Experimenters were rigorous in only recording from pairs of neurons whose spike waveforms were strikingly different.
From the Allen Institute for Brain Science (AIBS) mouse database [16], we examined calcium fluorescence data. Fluorescence of mouse visual cortex neurons expressing GCaMP6f was measured via 2-photon imaging through a cranial window. We analyzed signals recorded in response to natural scenes and static gratings presented for 0.25 s each with no interval between with 50 repeats in random order. The natural scene stimulus consisted of 118 natural images from a variety of databases. The static grating stimulus consisted of a full field static sinusoidal grating at a single contrast (80%). Gratings were presented at 6 orientations, 5 spatial frequencies (0.02, 0.04, 0.08, 0.16, 0.32 cycles/degree), and 4 phases (0, 0.25, 0.5, 0.75). For every trial, ΔF/F was estimated using the average fluorescence of the preceding second (4 image presentations) as baseline. We analyzed the average change in fluorescence during the 1/2 s period after the start of the image presentation relative to 1 s before. We also examined SUA data from the AIBS mouse neuropixel data set [17], which was recorded in response to the same stimuli as the calcium data. Spike counting windows were 0.25 s.

ER
Our estimator,r 2 ER , is derived via the strategy of Haefner and Cumming [6] to unbias the numerator and denominator of the coefficient of determination. Haefner and Cumming in turn cite Sahani and Linden [3] as the predecessor to their method. Sahani and Linden constructed an unbiased estimator of the variation in the expected response of the neuron (i.e. tuning curve) they called this 'signal power'. They normalize an estimator of explained variation, unbiased with respect to noise under conditions they did not specify (1-parameter regression of model predictions, see Methods, "Derivation of Normalized Signal Power Explained (SPE norm )"). Sahani and Linden, by not specifying how a given model prediction should be fit to the neural data before estimating the quality of the fit, introduced potential problems in their estimator. This was recognized by Schoppe et al. [8], who point out that the estimator was sensitive to differences between the mean and amplitude of the model predictions and the neural data. Consequently, the estimator could give large negative values because the squared error between model predictions and neural responses was unbounded. This criticism, while technically correct, is easily overcome by regressing (with intercept term) the given model predictions onto the neural data before using normalized SPE. Schoppe et al., motivated by the problems they found in SPE, focused on simplifying CC norm of Hsu et al. [4] to not require re-sampling by making use of the signal power estimator developed by Sahani and Linden. They derived a simple estimator whose square, termed here CC 2 norm-SP , we find is essentially numerically equivalent to SPE norm of Sahani and Linden in the case of one-parameter regression.
For the purpose of comparison, below we write out the exact formulas and approximate expected values of two prior methods [3,6] that are closely related tor 2 ER in the notation we use throughout our paper. For all estimators, we assume responses to m stimuli with n repeats where variance have been stabilized. The response to the ith stimulus, jth repeat, is Y i,j � N(μ i , σ 2 ) where σ 2 is the trial-to-trial variability and μ i the ith expected value of response after variance stabilization. The predictions are fixed for the m stimuli and the ith predicted expected value of the data is ν i and we assume they have been fit by a linear model with d degrees of freedom. When averaging data across trials our notation will be � Y i;� ¼ 1 n P n j¼1 Y i;j and across stimuli � Y �;j ¼ 1 m P n i¼1 Y i;j . Derivation of normalized signal power explained (SPE norm ). The original description of SPE [3] did not specify, when calculating sample estimates of 'power' (better known as variance), whether to normalize by m − 1 (the unbiased estimate) or m (the MLE). Nor was it specified whether an average across trials was used when calculating the difference between variance of the measured response and the residual. We thus use the formula as described by Schoppe et al. [8], who provide code to unambiguously calculate SPE, albeit in a manner that differs from their text (in particular, their TP ¼ ðn À 1Þ P n j¼1 1 mÀ 1 P m i¼1 ðY i;j À � Y �;j Þ 2 Þ (similar to Eq 5 of Sahani and Linden), but in their code it is implemented as 1 n P n j¼1 1 mÀ 1 P m i¼1 ðY i;j À � Y �;j Þ 2 Þ). In the context of the quantities being estimated, the latter makes more sense (see calculation of expected value of denominator below). For comparison to the derivation in Schoppe et al., their notation equates to ours as follows: N = n, T = m, R = Y i,j , y ¼ � Y i;� ,ŷ ¼n i , and VarðyÞ ¼ 1 Finally, in our notation their estimator is: Putting the expectations into the numerator and denominator we have: We note that only if d = 1 (i.e., the model has only 1 term) is the numerator unbiased. Below we describe how Haefner and Cumming developed an estimator that accounts for degrees of freedom more generally. Derivation of U. For comparison to the original paper of Haefner and Cumming [6], we give their notation and its equivalent terms in our notation: These authors explicitly attempt to remove the bias of the coefficient of determination: Their unbiased estimator is derived by dividing the numerator and denominator by the sample trial-to-trial variability (ŝ 2 ¼ 1 Forming the ratio, we obtain the Haefner and Cumming estimator: Extension ofr 2 ER to fit of linear model The derivation of Haefner and Cumming via the non-central F was not necessary: the expectation of the numerator and denominator are straightforward to calculate as non-central χ 2 random variables. While ourr 2 ER is explicitly meant to be the analogue of Pearson's r 2 , we rederive the Haefner and Cumming formula along the lines ofr 2 ER for measuring the fit of a lin-