Inferring decoding strategies for multiple correlated neural populations

Studies of neuron-behaviour correlation and causal manipulation have long been used separately to understand the neural basis of perception. Yet these approaches sometimes lead to drastically conflicting conclusions about the functional role of brain areas. Theories that focus only on choice-related neuronal activity cannot reconcile those findings without additional experiments involving large-scale recordings to measure interneuronal correlations. By expanding current theories of neural coding and incorporating results from inactivation experiments, we demonstrate here that it is possible to infer decoding weights of different brain areas at a coarse scale without precise knowledge of the correlation structure. We apply this technique to neural data collected from two different cortical areas in macaque monkeys trained to perform a heading discrimination task. We identify two opposing decoding schemes, each consistent with data depending on the nature of correlated noise. Our theory makes specific testable predictions to distinguish these scenarios experimentally without requiring measurement of the underlying noise correlations.


INTRODUCTION
Although much is known about how single neurons encode information about stimuli, how neurons contribute to percepts is less well understood 1 . The latter, called the "decoding problem", seeks to identify how the brain uses the information contained in neuronal activity. Although some studies have sought to understand principled ways to decode population responses in the presence of correlated noise [2][3][4][5][6][7][8][9][10][11][12] , the rules by which the brain actually integrates information across noisy neurons remain unclear.
Neuroscientists have traditionally investigated this question using two distinct approaches: causal or correlational. In causal approaches, experimenters selectively activate or inactivate brain regions of interest, and measure resulting perceptual or behavioural changes. In correlational approaches, experimenters measure correlations between behavioural choices and neuronal activity, typically quantified by 'choice probability' (reviewed in Ref. 13 ) or, more straightforwardly, by 'choice correlation' (CC) 14,15 . If CCs reflect a functional link between neurons and behaviour, one would expect brain areas with greater CCs to contribute more strongly to behaviour. This naïve view is contradicted by recent results that reveal a striking dissociation between the magnitude of CCs and the effects of inactivation across brain systems in rodents 16,17 and primates 18,19 . In hindsight, this apparent disagreement is not all that surprising because the two techniques, on their own, yield results whose interpretation is fraught with major difficulties.
For instance, the CC of a neuron depends not only on its direct influence on behaviour but also on the influence of all the other neurons with which it is correlated. As an extreme example, a neuron that is not decoded at all could be correlated with one that is, and thus exhibit choice-related activity 9 . Recent theoretical results show that it is possible, in principle, to use knowledge of noise correlations to extract decoding weights from CCs 14 . However, directly measuring the correlational structures that matter for decoding may be extremely difficult 20 . This problem is compounded by the fact that behaviourally relevant information may be distributed across neurons in multiple brain areas, so neuronal CCs in one area may depend on activity in other areas. Moreover, in causal approaches, inactivation of one brain area could lead to a dynamic recalibration of decoding weights from other areas. Therefore, changes in behavioural thresholds following inactivation may not be commensurate with the contribution of the area.
When analysed in conjunction, however, results from correlational and causal studies may together provide constraints that can be used to precisely determine the relative contributions of the brain areas involved. In this work, we extend recent theories 14,15,20 and propose a general framework for inferring decoding weights of neurons across multiple brain areas using CCs and changes in behavioural threshold following inactivation. The two quantities together provide a direct estimate of the relative contributions of different areas without needing to precisely measure the correlation structure. We demonstrate our technique by applying it to data from macaque monkeys trained to perform a heading discrimination task. In this task, there is a known discrepancy 18,[21][22][23] between CCs and the effects of inactivating two brain areas: although neurons in the ventral intraparietal (VIP) area were found to be substantially better predictors of the animal's choices than dorsal medial superior temporal (MSTd) neurons, performance is impaired by inactivating MSTd but not VIP. We use our framework to extract key properties of the decoder that can account for these counter-intuitive results. To our surprise, we find that, depending on the structure of correlated noise, experimental data are consistent with two opposing schemes that attribute either too much or too little " = 0 1 " 1 + 0 3 " 3 (1) where " 1 = ! 1 $ (& 1 − ( 1 (" ) )) and " 3 = ! 3 $ (& 3 − ( 3 (" ) )) denote unbiased estimates derived from neurons in subpopulations x and y respectively. Thus the problem of decoding multiple populations can be viewed as one of scaling and combining estimates from individual populations. Note that this is equivalent to a single linear decoder of both populations together using ! = 0 1 ! 1 0 3 ! 3 . The form of equation (1) has two advantages: (i) it is easy to identify and compare the relative contributions of the two areas to behaviour through the ratio 0 1 /0 3 , and (ii) one can dissociate how the weight patterns (! 1 and ! 3 ) and their scales (0 1 and 0 3 ) affect the output of the decoder.
This mathematical separation is also appealing because it provides a common framework to synthesize results from experiments conducted at two fundamentally different levels of granularity. One class of experiments involves making fine measurements such as the correlation between trial-by-trial fluctuations in the activity 5 6 of an individual neuron 7 and the animal's decision (Figure 1a). The second class of experiments studies causation by measuring behavioural effects of inactivating certain candidate brain areas. For perceptual discrimination tasks, this is done by comparing coarse measures such as the animal's discrimination thresholds before (8) and after (8 91 and 8 93 ) inactivating population x or y (Figure 1b).

Figure 1. Experimental strategies. (a)
An illustration of a feedforward network with linear readout. The decoder linearly combines the activity r of neurons in populations x and y with weights w, to produce an estimate " of the stimulus. Activity of individual neurons 5 6 is correlated with " and is quantified by either the choice probability :; 6 , or the closely related choice correlation : 6 . In an optimal system, the weights w generate choice correlations that satisfy equation 2.1. (b) In inactivation experiments, the neurons from each population are inactivated and the resulting changes in behavioural threshold are recorded.
We would like to use these experimental measurements to identify the relative behavioural contributions of two brain areas. Therefore we will present a technique to infer neuronal weights in two brain areas, focusing primarily on how to extract the scaling factors, 0 1 and 0 3 , of the brain areas rather than the fine structure, ! 1 and ! 3 , of the decoding weights. We first present some results that allow us to examine the pattern of choice correlations of neurons in both areas to characterize the degree of suboptimality in decoding. We will then show how to combine choice correlations with inactivation results to obtain quantitative estimates of the relative scaling of readout weights in those areas.

Analysis of choice correlations
Choice correlation of a neuron k is the correlation coefficient, across repeated trials with the same stimulus s, between its response 5 6 and the animal's estimate of the stimulus ", : 6 = Corr(", 5 6 |"). It has recently been shown that readout weights are optimal only if neuronal choice correlations all satisfy the following relation 15 (Supplementary note S1): : 6,ABC = 8 8 6 (2.1) where : 6,ABC is the choice correlation of neuron k expected from optimal decoding, 8 6 is the discrimination threshold of neuron k, and 8 is the behavioural discrimination threshold. Therefore if neurons from both areas satisfy the above equation, this gives us strong evidence that the neuronal weights and consequently their relative scales E = (0 1 , 0 3 ) are optimal. As we will see later, the exact values of a can then be directly extracted from the behavioural thresholds 8 91 and 8 93 following inactivation of those areas.
The pattern of choice correlations generated by any generic suboptimal decoder is more complicated, as it depends explicitly on the structure of noise covariance 14 . For a population of N neurons, the covariance Σ describes the noise power along N orthogonal noise modes. Each of these modes contributes to the overall choice correlation according to (Supplementary note S2): In this expression we have decomposed the optimal pattern of choice correlations : 6,ABC into components : 6,ABC G originating from the different noise modes of Σ, with : 6,ABC G H GIJ = : 6,ABC . The multipliers F G reflect the extent of suboptimality. When decoding weights are optimal, every multiplier F G = 1, so the above equation reduces to equation 2.1.
In principle, it is very difficult to estimate all of the multipliers F G because the components : 6,ABC G depend on the individual noise modes of Σ (Methods M1 -equation 4). Directly measuring Σ is a notoriously challenging task 20 that involves simultaneously recording the activity of a large population of neurons, and is nearly impossible for certain areas due to the geometry of the brain. Even if such recordings are carried out, it would be impossible to get an accurate assessment of the fine structure of covariance with limited data due to errors arising from finite measurement density 24 . Fortunately, since neuronal choice correlations are measurably large, it follows that one can infer decoding weights with reasonable precision by estimating the few leading multipliers that depend only on the most dominant modes of covariance. This is because if the correlated noise modes with small variance were to dominate the decoder, then only a tiny fraction of each neuron's variations would propagate to the decision, leading to immeasurably small choice correlations 15 ( Figure S1). It is possible to determine properties of the leading modes of covariance without large-scale recordings, and we will consider two ways producing two different noise models: extensive information and limited information.
Extensive information model A common way to measure important components of the covariance structure is through pairwise recordings. Noise covariance measured between pairs of neurons can be modeled as a function of their response properties, such as the difference in their preferred stimulus or the similarity of their tuning functions, to obtain empirical models of noise. One such model is limited-range noise correlations [25][26][27][28][29][30] , so called because they are proportional to signal correlation and thereby limited in range to pairs with similar tuning. We use this model to approximate a full noise covariance for all neurons in the population 31,32 A characteristic feature of extensive information models is that the amount of information in the neural activity is very large because it grows with population size [33][34][35] , hence the name. The amount of information extracted by a decoder restricted to the subspace spanned by the few dominant components of covariance cannot be greater than the information available in that subspace. For a model with extensive information, this subset would be a tiny fraction of the total information available in the population. Although this restriction is justified by the large magnitude of neuronal choice correlations, the choice of this noise model is only justified under the assumption that the brain is radically suboptimal.

Limited information model
Extensive information models are based on measurements of neural populations but, as we mentioned above, current recordings are not sufficient to measure or even infer the covariance matrix in vivo. It is therefore possible that information in cortex is not extensive. Indeed, the extensive information model conflicts with the fact that cortical neurons receive their inputs from a smaller population of neurons. The cortex must then inherit not only the input signal but also any noise in that input. This generates information-limiting correlations 15,20 in cortex, a form of correlated noise that looks exactly like the signal and thus cannot be averaged away by adding more cortical neurons. Since inferring the brain's decoding weights from choice-related activity depends on the noise covariance, we also consider the consequences of information-limiting correlations.
For fine discrimination between two neighboring stimuli s and " + N", the signal is given by the change in mean population responses ( s + δs − ( s ≈ δ" ( Q " . Information-limiting correlations for this task thus fluctuate along the direction ( Q , generating a covariance containing differential correlations 20 -that is, a covariance component proportional to (′(′ S . The constant of proportionality, which we denote as L, represents the variance of information-limiting correlations. With increasing population size, both the signal and this noise component grow identically, resulting in no further improvement in signal-to-noise ratio, and thus no improvement in discriminability. In general, ε could be very small, and hence information-limiting correlations may be very hard to detect with limited data as they are easily swamped by noise arising from other sources. Nevertheless, this noise has enormous implications for decoding large populations because it limits the total information to 1/L.
When dealing with two populations x and y, one has to keep in mind that although they may together receive limited information, they need not inherit it from exactly the same upstream neurons. Therefore we construct a more general model allowing the two populations to receive both distinct and shared information. The covariance between two neurons in this more general model would still be proportional to the product of the derivative of their tuning curves. However the constant of proportionality varies depending on whether the pair of neurons are both from the same population x (L 11 ), both from y (L 33 ), or from different populations (L 13 ) (Methods M9 -equation 8). For a large population with this noise structure, the total information content within the x and y subpopulations alone are by construction equal to 1/L 11 and 1/L 33 respectively. The information in both populations together is limited as well, once again by the (′(′ S component of the covariance. Depending on L 13 , the two subpopulations may contain completely redundant, independent, or synergistic information 36,37 . In case the two populations receive information from the same source, then L 11 = L 33 = L 13 yielding the familiar form of information-limiting correlations 15,20 T UV = T + L(′(′ S . Correlations that limit information within a single neural population introduce redundancy. As a consequence, many different decoding weights can extract essentially the same information. The system is then robust to some suboptimal decoding, which makes it easier to achieve near-optimal behavioural performance 15 . In the noise model for two populations described above, this is also true for each population individually. We can generalize this robustness in our framework by considering separate decoders of each population that produce estimates, " 1 and " 3 , that are near-optimal for their corresponding areas. Importantly, however, these estimates may have different variances, and may even covary, so they need to be properly combined to produce a good single estimate according to equation 1. While informationlimiting correlations within each area would make the system generally robust to the choice of weight patterns ! 1 or ! 3 , suboptimality could yet arise from an incorrect scaling (0 1 and 0 3 ) of the individual near-optimal estimates. This is because after the dimensionality reduction from large redundant populations down to single unbiased estimates per population, there is no redundancy left: just one degree of freedom remains for the decoder, so different ways of combining the estimates are not equivalent.
If the brain indeed combines activity from different areas suboptimally in this manner, then simplifying equation 2.2 in the presence of information-limiting correlations gives choice correlations within each area that are not equal to the optimal choice correlations, but are proportional to them (Supplementary note S5): Under these conditions, choice correlations in different areas x and y may have different multipliers F, say F 1 and F 3 , which depend on the scaling of the two brain areas and on the covariance between the two estimates derived from them. These multipliers F 1 and F 3 can be directly identified by regressing measured choice correlations against 8/8 6 , the choice correlations predicted for optimal decoding.

Combining choice correlations and inactivation effects to infer decoding weights
In the previous section, we showed how to reduce the fine structure of choice correlations down to one number for each population -F 1 and F 3 . We will now show how these multipliers can be used, together with the behavioural thresholds 8 91 and 8 93 following inactivation of areas x and y, respectively, to infer the relative scaling of their weights 0 1 and 0 3 . Inactivating an area is equivalent to setting the scaling of weights in that area to zero, so from equation 1, the animal's total estimate " would be equal to either " 1 or " 3 , depending on which area is inactivated. The resultant behavioural threshold would simply reflect the variance of the remaining estimate, which is equal to the magnitude of dominant decoded noise within the active area, so 8 91 K ≈ L 33 and 8 93 K ≈ L 11 .
If populations x and y are uncorrelated (L 13 = 0), then the ratio of weight scalings can be factorized into a product of ratios (Supplementary note S6): where the two independent factors represent outcomes of correlational and causal studies. If readout is optimal, then the multipliers F 1 and F 3 are both equal to one, so 0 1 /0 3 = 8 91 K /8 93 K . This is consistent with the general belief that the behavioural effects of inactivating a brain area must be commensurate with its contribution to the behaviour. A departure from optimality could break this relationship, so the effects of causal manipulation may not match the relative roles of the brain areas ( Figure S2). Even in purely feedforward networks, the magnitude of neuronal choice correlations need not equal the effects of inactivation. Thus, disagreements between the two experimental outcomes should not be entirely surprising and do not undermine the functional significance of either.
In fact, equation 3.1 revealed how one can combine choice correlations and behavioural thresholds to infer the contributions of two uncorrelated areas. But if the areas are correlated, one must explicitly account for the magnitude of correlation between areas L 13 and the ratio of scales no longer factorizes: where W = L 13 /L 11 is the magnitude of correlated noise between the two populations' estimates relative to the variance of estimates from x alone. Note that one can also use equations 3.1 and 3.2 to compute the optimal weight scaling factors simply by setting both F 1 and F 3 to 1. Therefore we can use these equations not only to determine the relative weights of brain areas but to also to evaluate precisely how suboptimal those weights are.

Application to data
We now use the techniques developed so far to infer the relative contributions of two brain areas in macaque monkeys to heading discrimination. Data were collected from monkeys trained to discriminate their direction of self-motion in the horizontal plane (Figure 2a) using vestibular (inertial motion) and/or visual (optic flow) cues (Methods M4; see also refs. 21,23 ). At the end of each trial, the animal reported whether their perceived heading " was leftward (" < 0°) or rightward (" > 0°) relative to straight ahead.

Discrepancy between correlation and causal studies
Responses of single neurons were recorded from either area MSTd (monkeys A and C; n=129) or area VIP (monkeys C and U; n=88) during the heading discrimination task (Methods M5). Basic aspects of these responses were analyzed and reported in earlier work 21,23 . Briefly, it was found that neurons in VIP had substantially greater choice correlations (CC) than those in MSTd (Figure 2b -left) for both the vestibular and visual conditions. This difference in CC between areas could not be attributed to differences in neuronal thresholds 8 6 (Figure 2b -middle), defined as the stimulus magnitude that can be discriminated correctly 68% of the time ([ Q =1) from neuron k's response 5 6 (Methods M6; Figure S3). Based on its greater CCs, one might expect that VIP plays a more important role in heading discrimination than MSTd. In striking contrast to this expectation, a recent study showed that there was no significant change in heading thresholds following VIP inactivation for either the visual or vestibular stimulus conditions 18 (Figure 2b -right (blue); monkeys B and J). On the other hand, inactivation of MSTd using a nearly identical experimental protocol led to substantial deficits in heading discrimination performance 22 (Figure 2b -right (red); monkeys C, J, and S). The neural and inactivation studies in VIP used non-overlapping subject pools, so the observed dissociation between CCs and inactivation effects could potentially reflect the idiosyncrasies of the subjects' brains. To rule this out, we repeated the inactivation experiment by specifically targeting Muscimol injections to sites in area VIP that were previously found to contain neurons with high CCs in another monkey and obtained similar results ( Figure S4).  These findings reveal a striking dissociation between choice correlations and effects of causal manipulation: VIP has much greater CCs than MSTd yet inactivating VIP does not impair performance. One may be tempted to simply conclude that VIP does not contribute to heading perception. We will now show that this is not necessarily true. Depending on the structure of correlated noise and the decoding strategy, neurons in both areas may be read out in a manner that is entirely consistent with the observed effects of inactivation.

Test for Optimality
We first asked if the above results can simply be explained if the brain allocated weights optimally to the two areas.
To answer this, we tested if neuronal choice correlations satisfied equation 2.1. Binary discrimination experiments typically do not measure choice correlations : 6 = Corr(5 6 , "|" = " ) ) because they do not have direct access to the animal's continuous stimulus estimate "; they only track the animal's binary choice. Instead they measure a related quantity known as choice probability defined as the probability that a rightward choice is associated with an increase in response of neuron k according to :; 6 = ;(5 6 \ > 5 6 9 ) where 5 6 ± ∼ ;(5 6 |sgn " = ±1) is a response 5 6 ± of neuron k when the animal chooses ±1. Therefore we first transformed the measured choice probabilities to choice correlations using a known relation 14 before further analyses (Methods M7). Equivalently, one could measure the correlation Corr(5 6 , sgn " |" = " ) ) between the neural response and the binary choice, which 15 showed is ≈ 0.8: 6 . Note that the above definition gives choice correlations that can be positive or negative depending on whether rightward choices are associated with an increase or decrease in the neuronal response. Therefore we adjusted equation 2.1 to generate predictions for optimal CCs that accounted for our convention (Methods M7). Whereas the experimentally measured choice correlations (: 6 ) of neurons in MSTd (blue) for both the vestibular (left) and the visual (right) condition are well described by the optimal predictions (: 6,ABC ), those of VIP neurons are systematically greater (red). This observation was consistent across all monkeys (see Supplementary Figure S4a for monkey `). Solid lines correspond to the best linear fit. Vestibular data replotted from Ref. 15 with different sign convention. , demonstrating that our data are inconsistent with optimal decoding. Note that, if VIP is decoded suboptimally, this implies that the overall decoding-one based on both VIP and MSTd-is suboptimal as well because the decoder failed to use all information available in the neurons across both populations.
This leads to two questions: First, how much information is lost by suboptimal decoding? Second, how is this information lost? To get precise answers, we will now determine how the brain weights activity in MSTd and VIP to perform heading discrimination.

Inferring readout weights
Throughout this section, we use subscripts M and V to denote MSTd and VIP instead of the generic subscripts x and y used to describe the methods. For clarity, we will restrict our focus to the vestibular condition but results for the visual condition are presented in the supplementary notes. In order to determine decoding weights, we constructed two kinds of covariance structures that implied either extensive or limited information as explained earlier.
In the extensive information case, we modeled noise covariance using data from pairwise recordings within MSTd and VIP reported previously 21,29 . Those experiments established that noise correlation between neurons in these areas tends to increase linearly with the similarity of their tuning functions, or signal correlation (Methods M8 -equation 7.1). This relationship between noise and signal correlations has a substantially steeper slope in VIP than in MSTd (MSTd: a b =0.19±0.08; VIP: a c =0.70±0.16, Figure S5). We used these empirical relationships to extrapolate noise correlations between all pairs of independently recorded neurons within each of the two populations, using only their tuning curves, and assuming that any stimulus-dependent changes in correlation were negligible. Since correlations between VIP and MSTd populations were not measured experimentally, we explored different correlation matrices (Methods M8equation 7.2).
In the limited information case, we added correlations that limited the total information content across the two populations (Methods M9 -Equation 8). For this latter case, we relied on behavioural thresholds before and after inactivation, and choice correlations, to determine the magnitudes of noise within (L bb and L cc ) and between (L bc ) areas (Methods M9). In both cases, we constructed covariances for many different population sizes N by sampling equal numbers of neurons from both areas with replacement. The choice of distributing neurons equally among the two areas was made only for convenience and has no bearing on the result as explained later.

Figure 4a
shows example covariance matrices for both extensive and limited information models for a population of 128 neurons. The two structures look visually similar because the additional fluctuations caused by information-limiting correlations are quite subtle. Nevertheless, there is a huge difference between the two models in terms of their information content (Figure 4b). The extensive model has information that grows linearly with N, implying that these brain areas have enough information to support behavioural thresholds that are orders of magnitude better than what is typically observed. However when information-limiting correlations are added, information saturates rapidly suggesting that behavioural thresholds may not be much lower than population thresholds even if the decoding weights are fine-tuned for best performance. We will now infer scaling factors 0 b and 0 c of decoding weights using both noise models and examine their implications.

Extensive information model
We've already seen that the pattern of choice correlations is not consistent with optimal decoding of MSTd and VIP. In fact for the extensive information model, optimal decoding will lead to extremely small CCs by suppressing response components that lie along the leading noise modes as they have very little information ( Figure S6a). Ironically, the magnitude of CCs found in our data could only have emerged if the response fluctuations along those leading modes substantially influenced animal's choice ( Figure S6b). This means that the decoder must be largely confined to the subspace spanned by those modes. We therefore restricted our focus to the two leading eigenvectors g J and g K of the covariance matrix. When the two populations are uncorrelated, these vectors lie exclusively within the one-dimensional subspaces spanned by neurons in MSTd and VIP respectively (Figure 5a). In our case, vectors g J and g K corresponded to g c and g b . Although decoding only this subspace is not optimal with respect to the total information content in the two areas, a decoder could still be optimal within that subspace. To test this, we estimated the choice correlations  Table 1), we applied the exact rather than approximate form of equation 3.1 and obtained a scaling ratio 0 b /0 c = 0.8 ± 0.1.
To test whether the inferred scaling was meaningful, we compared behavioural thresholds implied by the resulting decoding scheme against experimental findings of inactivation. The threshold prior to inactivation is related to the variance of the estimator whose decoding weights ! are along the direction specified by 0 b g b + 0 c g c . Inactivating either area is equivalent to setting the corresponding scaling to zero so postinactivation thresholds are given by the variance along the leading noise mode specific to the active area (g b or g c ). We computed pre and post-inactivation thresholds and found they were qualitatively consistent with experimental results: for large populations, MSTd inactivation is predicted to produce a large increase in threshold (Figure 5c, red vs black) whereas VIP inactivation is predicted to have little or no effect (Figure 5c, blue vs black; see Figure S7 for visual condition). This correspondence to experimental inactivation results is remarkable because the procedure to deduce scalings 0 b and 0 c was not constrained in any way by behavioural data, but rather informed entirely by neuronal measurements. We also confirmed that the threshold expected from optimal scalings ( Table 1) was smaller than that produced by inferred weights (Figure 5c, green vs black) implying that the brain indeed weighted the two areas suboptimally.  Figure 4b, the behavioural threshold predicted by the inferred weights (black) saturates at a population size of about 100 neurons. The green line indicates the performance of an optimal decoder within the two-dimensional subspace. Inactivating VIP is correctly predicted to have no effect on behavioural performance for large h (blue), while MSTd inactivation increases the threshold (red). (d) A schematic of the inferred decoding solution projected onto the first principal component of noise in VIP and MSTd. The solid colored lines correspond to the readout directions for the four cases shown in (c). The long diagonal black line is the projection of the mean population responses for headings from -9° to +9°, and the two gray ellipses correspond to the noise distribution at heading directions of ±2°. The colored gaussians correspond to the projections of this signal and noise onto each of the four readout directions, and the overlap between these gaussians corresponds to the probability of discrimination errors. (e) The percentage of available information read out by the inferred decoder (the decoding efficiency) decreases with population size, because the decoded information saturates while the total information is extensive. (f) Correlations between MSTd and VIP were not measured experimentally. We modeled these correlations according to the same linear trend that on average described correlations within each population, but with different slopes, yielding different interareal correlations parametrized by W = L ij /L ii (Methods M8 -Equation 7.2). This slope reaches its maximum allowable value W klm = L cc /L bb , the geometric mean of the slopes for MSTd and VIP. (g) For each value of W, we used the resultant covariance and CCs to infer the decoder, and plotted its behavioural thresholds. Thresholds are shown for a population of 256 neurons, by which point the performance had saturated to its asymptotic value for all W. Shaded regions in (c), (e), and (g) represent ±1 SEM.
The above findings are explained graphically in Figure 5d by projecting the relevant quantities (tuning curves (("), noise covariance Σ, decoding weights !) onto the subspace of the first two principal components (g b and g c ) of the noise covariance Σ. The colored lines indicate different readout directions, determined by the scaling (0 b and 0 c ) of weights for the two populations. A ratio of 0 b /0 c > 1 corresponds to greater weight on the estimate derived from MSTd activity, and the associated readout direction will be closer to the principal component of MSTd. The response distributions are depicted as gray ellipses (isoprobability contours) for the two stimuli to be discriminated. The discrimination threshold for different decoders can be obtained simply by projecting these ellipses onto the readout direction of the specified decoder and examining the overlap between the projections. Within this subspace, the ratio 0 b /0 c of the decoder inferred from CCs was much smaller than the optimal ratio (Table 1), meaning that MSTd was given too little weight. Consequently, the response distributions have more overlap along the direction corresponding to the decoder inferred from neuronal CCs (black) than along the optimal direction in that subspace (green). This means that the outputs are less discriminable and thus that the decoding is suboptimal. VIP inactivation (0 c =0) corresponds to decoding only from MSTd (blue). This happens to produce no deficit because the overlap of the response distributions is similar to that along the original decoder direction. On the other hand, inactivating MSTd (0 b =0) corresponds to decoding only from VIP (red), where the two response distributions have greater overlap leading to a larger threshold.

Model
Extensive information model † Limited information model

Model predictions
Multiplicative change in CCs following inactivation Table 1. Model parameters and predicted changes in CCs following inactivation for the two covariance models, shown as median ± central quartile range. ( † Values correspond to when decoder is inferred using a rank-two approximation of the covariance.) It is important to keep in mind that decoding the noisiest two-dimensional subspace, which throws away all signal components in the remaining low-noise N-2 response dimensions, is a much more severe suboptimality than misweighting the two areas' signals within that restricted subspace, which loses less than half the information (Figure 5c). As illustrated in Figure 5e, the fraction of available information recovered by this decoder (r) drops precipitously with the number of neurons (r ~ 2.5N -1 ). Moreover, for this model, a steeper relationship between signal and noise correlations leads to greater CCs. This is because the model is only consistent with suboptimal decoding that fails to remove the strong noise correlations; these noise correlations are decoded to drive the choice, and thus correlate neurons not only with each other but also with that choice. Thus, in the extensive information model, high CCs are a consequence of decoding a restricted subspace of neural activity, a radically suboptimal strategy for the brain.
Behavioural predictions of this model were robust to assumptions about the exact size of the decoded subspace ( Figure S8), but were found to depend on the magnitude of noise correlations between the VIP and MSTd populations. Since interareal correlations were not measured, we systematically varied the strength of these correlations by changing W (Figure 5f), and used equation 3.2 to infer weight scalings for each case. We used these scalings to generate behavioural predictions for different values of W. Predictions for one example value of these correlations are shown in Figure S9. Behavioural predictions progressively worsened as a function of the strength of noise correlations between MSTd and VIP: for this model, even weak but nonzero interareal correlations imply that inactivating area VIP should improve behavioural performance (Figure 5g).

Limited information model
In the presence of information-limiting correlations, choice correlations must be proportional to the ratio of behavioural to neuronal thresholds (Equation 2.3). This was indeed the case both in MSTd and VIP as we showed already in Figure 3. Those slopes correspond to the multipliers F b and F c for this model, and were found to be different for the two areas ( Table 1). As we noted earlier, unlike the leading modes of noise in the extensive information model, the magnitudes of information-limiting correlations (L bb , L cc , and L bc ) are difficult to measure. Nevertheless, we can deduce them from behaviour because behavioural precision is ultimately limited by these correlations. Briefly, using behavioural thresholds after inactivation of each area, along with F b and F c derived from choice correlations as additional constraints, we can simultaneously infer the magnitude of informationlimiting correlation within each area (L bb and L cc ), the correlated component of the noise (L bc ), and weight scalings (0 b and 0 c ) (Methods M9). A model based on these inferred parameters correctly predicted that the behavioural threshold before inactivation would not be significantly different from threshold following VIP inactivation (Figure 6a; see Figure S10 for visual condition). This was because the scaling of weights in MSTd was much larger than in VIP according to this model (0 b ≫ 0 c , Table 1), so inactivating VIP had little impact on the output of the decoder and left behaviour nearly unaffected. Unlike the decoder inferred for the extensive information model, the efficiency η of this decoder did not depend on the size of the population being decoded (Figure 6b, vestibular: r = 79±13%) because neurons in this model carry a lot of redundant information.
All analyses above were performed on neural data in the central 400ms of the trials following earlier work. However our conclusions are robust to the specific time ( Figure S11) and duration ( Figure S12) of the analysis window. Additionally, although we extrapolated our data to larger populations by resampling from a set of about 100 neurons recorded from each area, our results are not attributable to the limited size of the recording ( Figure S13). We also extended our model to account for the fact that the two brain areas may have only been partially inactivated by Muscimol, and found that our conclusions hold under a wide range of partial inactivations (Supplementary note S8; Figure S14). Finally, we assumed that inactivation leaves responses in the un-inactivated area unaffected, as would be the case in a purely feedforward network model. While an exhaustive treatment of recurrent networks is beyond the scope of this work, we find that our conclusions can still hold if the above assumption is compromised by recurrent connections between MSTd and VIP (Supplementary note S9; Figure S15).

Comparison of the two decoding strategies
We inferred decoding weights in the presence of two fundamentally different types of noise, the extensive information model and the limited information model. Both of these decoders could account for the behavioural effects of selectively inactivating either MSTd or VIP, albeit with very different readout schemes. For the extensive information model, neurons in area VIP were weighted more heavily than optimal, and vice-versa in the presence of information-limiting noise (Table 1, Figure 7a). Why do the two models have such different weightings? Both noise models have larger noise in VIP than MSTd, but differ in correlations between the two areas. In the extensive information model, the interareal correlations must be nearly zero to be consistent with behavioural data (Figure 5g and Figure S9), and the neuronal weights in VIP must be high to account for the high CCs. In the limited information model, the significant interareal correlations explain the large CCs in VIP, even with a readout mostly confined to MSTd.
How could such fundamentally different strategies lead to the same behavioural consequences? For a given noise model, an optimal decoder achieves the lowest possible behavioural threshold by scaling the weights of neurons in the two areas according to a particular optimal ratio 0 b /0 c . Ratios that are either smaller or larger than this optimum will both result in an increase in the behavioural threshold due to suboptimality. This produces a U-shaped performance curve. Under certain precise conditions, complete inactivation of one of the areas will leave behavioural performance unchanged, exactly on the other side of the optimum. This is the case for VIP according to the extensive information model (Figure 7b -top). On the other hand, if the weight is already too small to influence behaviour then inactivation may not appreciably change performance, as demonstrated by the limited information model (Figure 7b -bottom).

Model predictions
According to the extensive information model, the brain loses almost all of its information by poorly weighting its available signals. Moreover, even beyond this poor overall decoding, the model brain gives VIP too much weight. As a consequence, this model makes a counterintuitive prediction that gradually inactivating VIP should improve behavioural performance! A hint of this might already be seen in Figure 1d and Figure S4b for the vestibular condition (both 0 and 12 h), although the difference was not statistically significant. Beyond a certain level of inactivation, as the weight decreases past the optimal scaling of the two areas, performance should worsen again (Figure 7c -top). According to the extensive information model, the brain just so happens to overweight VIP under normal conditions by about the same amount as it underweights VIP after inactivation. Suboptimal decoding in the limited information model has the opposite effect, giving too little weight to VIP, while overweighting MSTd. However, according to this model, the available information in VIP is small, because when MSTd is inactivated the behavioural thresholds are substantially worse (Figure 7c -bottom). Thus the suboptimality due to underweighting VIP is mild (around 80% in both visual and vestibular conditions, as described above), and the predicted improvement following partial MSTd inactivation is negligible as gradual inactivation quickly shoots past the optimum. Graded inactivation of brain areas can be accomplished by varying the concentration of muscimol, as well as the number of injections. In fact, we have previously reported that behavioural thresholds increase gradually depending on the extent of inactivation of area MSTd 22 . Unfortunately, those results do not distinguish the two models, as there is no qualitative difference between the model predictions for partial MSTd inactivation (Figure 7c, red). Future experiments involving graded inactivation of VIP should be able to distinguish between the models due to the stark difference in their behavioural predictions.
The decoding strategies implied by the two models also have different consequences for how CCs should change during inactivation experiments (Methods M10). According to the extensive information model, VIP and MSTd are nearly independent, and both are decoded, so inactivating either area must scale up neuronal CCs in the other area (Figure 7d -top). In the limited information model, inactivating either area produces no significant changes in the other's CCs (Figure 7d -bottom). This effect has different origins for MSTd and VIP. Although inactivating MSTd confines the readout to VIP, it also eliminates the highvariance noise components that VIP shared with MSTd: these two effects approximately cancel leaving CCs in VIP essentially unaffected. The results of VIP inactivation are simpler to understand: CCs in MSTd do not change much because VIP has little influence on behaviour to begin with.

DISCUSSION
Several recent experiments show that silencing brain areas with high decision-related activity does not necessarily affect decision-making [16][17][18][19] . To explain these puzzling results, we have developed a general, unified decoding framework to synthesize outcomes of experiments that measure decision-related activity in individual neurons and those that measure behavioural effects of inactivating entire brain areas. We know from the influential work of Haefner et al 14 how the behavioural impact (readout weights) of single neurons relates to their decision-related activity (choice correlations) in a standard feedforward network. We built on this theoretical foundation by adding three new elements that helped us relate the influence of multiple brain areas to both the magnitude of choice correlations, and the behavioural effects of inactivating those areas.
First, we have generalised their readout scheme to include multiple correlated brain areas by formulating the output of the decoder as a weighted sum of estimates derived from decoding responses of individual areas.
In this scheme, the weight scales of individual estimates can be readily identified as the scaling of neuronal weights in the corresponding areas, providing a way to quantify the relative contribution of different brain areas. Second, we postulated that readout weights are mostly confined to a low-dimensional subspace of neural response that carries the highest response covariance, in both the extensive and limited information models. This postulate was instrumental to developing a theory of decoding that focused on the relationship between the overall scales of choice-related activity and neuronal weights, in lieu of their fine structures. Besides its mathematical simplicity, the resulting coarse-grained formulation confers an important practical advantage in that we can apply it without precisely knowing the fine structure of response covariance. Third, we used a straight-forward relation between behavioural threshold and the variance of the decoder to explicitly link the relative scaling of weights across areas to the behavioural effects of inactivating them.
Our theoretical result linking the behavioural influence of brain areas to their CCs and inactivation effects (Equation 3.1 and 3.2) is applicable only when neuronal weights within each area are mostly confined to the leading dimension of their response covariance. Although this requirement looks stringent, it is needed to explain the high CCs seen in experiments 15 . This claim might appear to be at odds with the fact that some earlier studies successfully predicted CCs that plateaued close to experimental levels using pooling models that did not explicitly take care of the above confinement 6,9 . However a closer examination revealed that these studies used a scheme in which decision was based on the average response of neuronal pools that were all uniformly correlated, a combination of model assumptions that in fact satisfies our requirement. Similar explanations apply to other simulation studies that used support-vector machines or alternative schemes that inadvertently restricted decoding weights to low-frequency modes of population response where shared variability was highest 12,30 . Thus our postulate is fully compatible with earlier work and in fact points to a more general class of models that can be used to describe the magnitude of CCs in those data.
Recent experiments show that reversibly inactivating area VIP in macaque monkeys does not impair animals' heading perception, despite the fact that responses of VIP neurons are strongly predictive of perceptual decisions 18,21 . In contrast, inactivating MSTd does adversely affect behaviour even though MSTd neurons exhibit much weaker correlations with choice 22,23 . Assuming that both areas contribute to decision, we used our framework to infer decoding strategies that could account for these experimental results. Surprisingly, the data were consistent with two different schemesoverweighting or underweighting of VIP -depending on whether information was extensive or limited. A major implication of the finding from the extensive information model is that if a causal test of function (e.g., inactivation) reveals no impairments, it does not disprove that a brain area contributes to a task. The limited information model on the other hand suggests that area VIP is indeed of very little use to heading perception. In spite of this difference, both models share a basic attribute, namely, that decoding is suboptimal (although to very different extents, as discussed in the next section). Therefore our analysis reveals that the observed discrepancy between decision-related activity and effects of inactivation is not peculiar, and is actually expected from systems that integrate information across brain areas in a suboptimal fashion. The nature of this suboptimality can be understood intuitively by drawing an analogy to cue combination. Imagine there are two cues x and y, and you use a suboptimal strategy in which a larger weight is allocated to the less reliable cue y. If y is removed thereby forcing you to rely completely on x, then your behavioural precision might not change very much if the reduction in information from losing y is offset by the gain in information from x. On the other hand, if you mostly ignored y to begin with, then once again you will be unaffected by its removal. Either "too much" or "too little" weighting of a brain area can lead to suboptimal performance, both in a way that leaves the behavioural threshold largely unaltered following complete inactivation of that area.

Decoding is suboptimal, but just how bad?
Although both models were suboptimal to some degree, the overwhelming distinction between them is the efficiency they imply for neural computation, where efficiency is the ratio of decoded information to available information. The efficiency of the limited information model is around 80%, independent of population size N. In contrast, the extensive information model encodes information that grows with N, while decoding is restricted to the least informative dimensions of neural responses. These decoders extract only a tiny fraction of the available information, resulting in an efficiency that falls inversely with N. For a modest-sized population of 1000 neurons, the efficiency is already less than 1%. Thus, the conventional model of correlated noise (with extensive information) is radically suboptimal, whereas the limited information model extracts an impressive fraction of what is possible, limited largely by noise.
It has previously been argued that the key factor that limits behavioural performance in complex tasks is suboptimal processing, not noise 38 . However, in simple tasks involving binary choices, and in areas in which most of the available information can be linearly decoded, it is unclear why the behaviour of highly trained animals should be so severely undermined by suboptimality. Moreover, radical suboptimality of the kind described here for the extensive information model implies tremendous potential for learning, as the neural circuits can continually optimize the computation by tuning the readout to more informative dimensions. This is hard to reconcile with the observation that behavioural thresholds in a variety of perceptual tasks typically saturate within a few weeks of training in both humans and monkeys 29,[39][40][41] . In the presence of information-limiting noise, however, learning can only do so much, and performance must saturate at or below the ideal performance. Therefore we regard the limited information model as a much more likely explanation of our data, for otherwise one would need to posit that cortical computations discard the vast majority of available information. Note that suboptimal cortical computation might still account for information loss in the limited information model, as opposed to neural noise 38 , but this information loss is now much more modest, probably around 20%.
A direct way to tell the two models apart would be to measure the structure of noise correlations. Unfortunately, this is not straightforward, because the differences between noise models giving extensive or limited information can be quite subtle 20 . In fact, there can be a whole spectrum of subtly different noise models with different information contents, lying between the two models that we have considered here. Therefore, a more accurate technique to determine the information content (which, after all, is a major reason why we care about noise correlations) is simply to record from hundreds of neurons simultaneously, and then decode the stimulus. This will provide a lower bound on the information available in the neural population. One can then compare the resultant population thresholds with the behavioural threshold to determine how suboptimal the decoding needs to be to account for behaviour. Eventually, we expect this strategy will be successful, but it will require advances in recording technology to be viable in the target brain areas. Meanwhile, by examining the key properties of the decoding strategy implied by the two models, we identified distinct predictions that are testable without large-scale simultaneous recordings. Specifically, they involve fairly simple experiments such as graded inactivation of VIP, and measurement of CCs in either VIP or MSTd while the other area is inactivated (Figure 7). Future experiments will test each of these predictions to provide novel evidence about the information content and decoding strategy used by the brain.

Limitations of the framework and possible extensions
Similar efforts to deal with outcomes of correlational and causal studies using a coherent framework are rarely undertaken, despite their significance. To our knowledge, there is only one instance where this has been attempted before 42 . In that work, the authors used a recurrent network model with mutual inhibition between populations 43,44 to reconcile choice-related activity and the effect of silencing neurons. Although their study was similar to ours in spirit, their goal was different. They showed that inactivation just before a decision, when activity was highly correlated with the choice, had less impact on the behaviour than inactivation near the stimulus onset. This addresses a temporal, as opposed to a spatial, dissociation between correlation and causation, so a model with recurrent connectivity was essential to explain their findings. In contrast, we wanted to account for the discrepancies between measures of correlation and causation across brain areas. This latter phenomenon is entirely within the realm of standard feedforward network models in which both populations causally contribute, rather than compete to drive behaviour, and differ only in terms of the relative strength of their contributions.
Time-varying weights have been shown to better predict animals' choice in certain tasks 45 , and psychophysical kernels are sometimes skewed towards one end of the trial 46,47 , suggesting that decoding could also be suboptimal in time. Such temporal weighting of information would naturally arise from recurrent connectivity, which is beyond the scope of this work. But it can also originate in feedforward networks, possibly through a gating mechanism that blocks the integration of neural responses beyond a certain time. 32 Other studies have considered that choice-related activity might arise from decision feedback 46,48,49 . Indeed, pure decision feedback to an area would create apparent sensitivity to sensory signals, even in the absence of direct feedforward input to the target neurons 46,48,49 . In such a case, neural sensitivity to the stimulus would then be precisely equal to the animal's sensitivity. In the absence of other sources of variability, response fluctuations would be perfectly correlated with fluctuations in the fed-back choice, producing choice correlations of 1. Of course there would be additional variability in the neural responses, and this would dilute both the choice correlations and neural tuning by equal amounts, giving rise to measured CCs that should match the optimal CCs (Equation 2.1). Even if there are other feedforward sensory components to the neural responses, direct decision feedback will pull the choice correlations toward this optimal prediction. Thus, simple decision feedback cannot account for the pattern of CCs observed in our VIP data, which are two to three times larger than predicted from optimal inference or direct decision feedback (Figure 3). Conversely, as we demonstrated through supplementary modeling, adding feedback or recurrent connections may not affect the suboptimal readout weights inferred using our scheme, even when those connections modulate responses along the decoded dimensions ( Figure S15). Nevertheless, future expansions of our work should account for more general recurrent connectivity to study how neural circuits simultaneously integrate information across space and time. In particular, recurrent networks also include decision feedback as a special case, and might help test alternative theories on the origins of choice correlations 1,46 .
Finally, while VIP inactivation did not impair heading discrimination, MSTd inactivation partially impaired the animal's ability to perform the task. The fact that MSTd inactivation did not completely abolish performance cannot be accounted for by our two-population models unless the inactivation was only partial and/or VIP is read out to some degree. Additionally, we cannot exclude the possibility that VIP is merely correlated with behaviour and that a third brain area besides MSTd contributes some task-relevant information. In fact, both of our models actually predict a somewhat bigger deficit following MSTd inactivation (Figure 5c, 6a) than is observed experimentally (Figure 1b). This highlights the importance of ultimately extending coding models to include more than two brain areas.
As neuroscience moves towards 'big data', there is a greater need for theoretical frameworks that can help discern simple rules from complex multi-neuronal activity 50 . We believe our work responds to this challenge and, despite its limitations, takes us closer to bridging the brain-behaviour gap for binary-decision tasks.  (Supplementary notes S1, S2). v ABC G depends on the shape of the i th noise mode g G , the amplitude of the signal ( Q (the derivative of the neurons' tuning curves), and the behavioural threshold 8 according to:

METHODS
If decoding is optimal, then multipliers F G ≡ 1 so the choice correlation : 6,ABC of neuron k becomes  (Supplementary note S0). This yields the constraint that 0 1 + 0 3 = 1 at all times.

M3. Relation between behavioural threshold and weight scaling factors.
Behavioural threshold 8 is proportional to the square root of the decoder variance (with proportionality of 1 for threshold of 68% correct), so 8 K = ! $ T!. If decoding is confined to the subspace of leading eigenmodes of T spanned by neurons within x and y (g 1 and g 3 ), then ! 1 ∝ 0 1 g 1 and ! 3 ∝ 0 3 g 3 where the constants of proportionality are chosen to ensure unbiased decoding. In this case, the behavioural threshold can be expressed purely in terms of weight scaling factors and the variance originating from noise within the noise modes as (Supplementary note S4): 8 K = 0 1 K L 11 + 0 3 K L 33 + 20 1 0 3 L 13 (5) where L 11 and L 33 are the magnitudes of noise within x and y, and L 13 is the magnitude of correlated noise. Thresholds following inactivation can be determined by setting the weight scaling factor for the inactivated area to zero, yielding 8 91 K = L 33 and 8 93 K = L 11 .

M4. Subjects and Behavioural Task.
Six adult rhesus monkeys (A, B, C, J, S, U, and X) took part in various aspects of the experiments. Three animals were employed in each of the MSTd (C, J and S) and VIP (X, B and J) inactivation experiments. Two animals provided the neural data from each brain area (A and C for MSTd; C and U for VIP). All surgical and experimental procedures were approved by the Institutional Animal Care and Use Committees at Washington University and Baylor College of Medicine, and were performed in accordance with institutional and NIH guidelines. All animals were trained to perform a heading discrimination task around psychophysical threshold. In each trial, the subject experienced a real or simulated forward motion with a small leftward or rightward component (angle s, Figure 1a). Subjects were required to maintain fixation within a 2x2˚ electronic window around a head-fixed visual target located at the center of the display screen. At the end of each 2-s trial, the fixation spot disappeared, two choice targets appeared and the subject made a saccade to one of the targets to report his perceived heading relative to straight ahead. Nine logarithmically spaced heading angles were tested (0˚, ±0.5˚, ±1.3˚, ±3.5˚, and ±9˚ for monkeys A and J, 0˚, ±1˚, ±2.5˚, ±6.4˚, and ±16˚ for monkeys B, C, S and U), including the ambiguous case of straight ahead motion (s = 0˚). These values were chosen to obtain near-maximal psychophysical performance while allowing neuronal sensitivity to be estimated reliably for most neurons 21,23 . Subjects received a juice reward for indicating the correct choice. For trials in which the ambiguous heading was presented, rewards were delivered randomly on half of the trials. The experiment consisted of three randomly-interleaved stimulus conditions (vestibular, visual, and combined). In the vestibular condition, the monkey was translated by a motion platform while fixating a head-fixed target on a blank screen. In the visual condition, the motion platform remained stationary while optic flow simulated the same range of headings. Under the combined condition, both inertial motion and optic flow were provided. Each of the 27 unique stimulus conditions (9 heading directions × 3 cue conditions) was repeated at least 20 times, for a total of 540 discrimination trials per recording session. Identical stimuli and trial structure were employed during both neural recordings and inactivation experiments.

M5. Neural recordings.
Activity of single neurons in areas MSTd and VIP was recorded extracellularly using epoxy-coated tungsten microelectrodes (impedance of 1-2 MΩ). Area MSTd was located using a combination of magnetic resonance imaging (MRI) scans, stereotaxic coordinates (~15 mm lateral and ~3-6 mm posterior to AP-0), white/gray matter transitions, and physiological response properties. In some penetrations, electrodes were further advanced into the retinotopically organized area MT 23 . Most recordings concentrated on the posterior/medial portions of MSTd, corresponding to more eccentric, lower hemifield receptive fields in the underlying area MT. To localize area VIP, we first identified the medial tip of the intraparietal sulcus and then moved laterally until there was no longer directionally selective visual response in the multiunit activity, as described in detail previously 21 .

M6. Estimation of Behavioural and Neuronal thresholds.
Behavioural performance was quantified by plotting the proportion of 'rightward' choices as a function of heading (the azimuth angle of translation relative to straight ahead). Psychometric data were fit with a cumulative Gaussian function with mean ~ and standard deviation 8, and this standard deviation defined the psychophysical threshold, corresponding to 68% correct performance ([ Q =1, assuming no bias, i.e. ~= 0°).
For the analysis of neuronal responses, we used the linear Fisher information which is simply a measure of the signal-to-noise ratio: signal power divided by noise power. The linear Fisher Information captures all of the Fisher information in responses generated from the exponential family with linear sufficient statistics. Its inverse is exactly equal to the variance of an unbiased, locally optimal linear estimator (for differentiable tuning curves and nonsingular noise covariance). We defined the square root of this variance (i.e. the standard deviation of the estimator) to be the neuronal discrimination threshold, which corresponds to 68% accuracy in binary discrimination. This threshold can be obtained directly from the neuron's tuning curve and noise variance as follows: where 8 6 and 6 are the threshold and linear Fisher information 51 for neuron k, Ç 6 Q is the derivative of the neuron's tuning curve at the reference stimulus (0˚), and Å 6 K is the variance of the neuronal response for that stimulus. Neuronal thresholds computed using the above definition were very similar to those computed using a traditional approach based on neurometric functions constructed from the responses of the recorded neuron and a presumed 'antineuron' with opposite tuning 52 (Supplementary Figure 3).

M7. Estimation of Choice correlation.
To quantify the relationship between neural responses and the monkey's perceptual decisions, we first computed choice probabilities (CP) using ROC analysis 53 . For each heading, neural responses were sorted into two groups based on the choice that the animal made at the end of each trial. In previous studies, the two choice groups were typically related to the preferred and non-preferred stimuli for a given neuron 21,23 . In this study, in order to compare different neurons in a population code, the two choice groups were simply rightward and leftward choices; hence, CPs may be greater than or less than 1/2. ROC values were calculated from these response distributions, yielding a CP for each heading, as long as the monkey made at least 3 choices in favor of each direction. To combine across different headings, we computed a grand CP for each neuron by balanced z-scoring of responses in different conditions, which combines z-scored response distributions in an unbiased manner across conditions, and then performed ROC analysis on that combined distribution 54 . The CPs were then converted to choice correlations according to : 6 ≈ É K :; 6 − J K (refs. 14,15 ) where :; 6 and : 6 are the choice probability and choice correlation of neuron k respectively (Supplementary note S0). Due to the convention we chose for computing CPs, the resulting choice correlation could be positive or negative depending whether a neuron predicted rightward choices by increasing or decreasing its response relative to reference stimulus. For an optimal decoder, the sign of a neuron's choice correlation should match the sign of the derivative of its tuning curve, so we modified the definition of ref. 15 (Equation 2.1) to accommodate our sign convention, yielding : 6,ABC = sgn Ç 6 Q 8/8 6 where sgn denotes the signum function.
There were neurons in both MSTd and VIP whose choice-related activity during the visual condition is anticorrelated with their signal-related activity 21,23 . Further analysis showed that heading preferences of these neurons during visual and vestibular conditions differed. Therefore the analysis of data collected during the visual condition presented in the Supplementary notes included only the subset of recorded neurons that had similar heading preferences as in the vestibular condition 23  where N GÖ is the Kronecker delta function (N GÖ is 1 when i=j, and 0 otherwise) and m is the slope of the relationship between signal correlations and noise correlations. This slope was much steeper in VIP than MSTd 21 . For the vestibular condition, slopes were found to be a b =0.19±0.08 and a c =0.70±0.16 within MSTd and VIP respectively, and for the visual condition they were a b =0.12±0.09 and a c =0.50±0.14. The above fits determined the average relationship between noise and signal correlations, but there was considerable diversity around this trend. To emulate this diversity, we used a technique similar to the one proposed in ref. 31 . Specifically, we sampled correlation coefficient matrices Ñ from a Wishart distribution with a mean matrix Ñ given by equation 7.1 and the fitted slope m, and rescaled them to ensure Ñ GG = 1. The number of degrees of freedom for the Wishart distribution was adjusted so sampled matrices had the same uncertainty in slope m as the data when subjected to the same fitting procedure. Covariance matrices were generated by scaling the correlation coefficients by the standard deviations for each neuron. Model variances were set equal to the mean responses, so the standard deviation of neuron i is f i 1/2 . Thus the covariance T is related to correlation coefficients R by T GÖ = Ñ GÖ Ç G Ç Ö . Correlations between responses of MSTd and VIP neurons were not measured experimentally, so the slope a bc of any linear trend relating noise and signal correlations between the two areas was not known. We explored different possibilities by varying a bc according to: where 7 ∈ [0,1). Each value of k produced correlation between areas with magnitude L bc which was expressed as L bc = WL bb .

M9. Noise covariance of limited information model. If the information reaching MSTd (M) and VIP (V)
is not perfectly redundant across the populations, then the resulting covariance matrix will be of the form: where ( b Q and ( c Q are derivatives of tuning curves of the neurons in M and V respectively, and T is the noise used in the extensive information model. Whereas ( b Q and ( c Q can be estimated by measuring the tuning curves of individual neurons, precisely estimating L bb , L cc , and L bc is difficult even with large-scale recordings as their magnitudes may be very small compared to the magnitude of noise in T. Nevertheless, we know that for large populations, the behavioural threshold will be dominated by the magnitude of information-limiting correlations. Specifically, they are related through the relative scaling of decoding weights in equation 5 where M and V take the places of x and y. Consequently, we can determine L bb and L cc from behavioural thresholds following inactivation using L bb = 8 9c K and L cc = 8 9b K . We can then use equation 5 in conjunction with equation 3.2 to determine both the ratio 0 b /0 c of weight scalings and the magnitude of correlation between populations L bc = WL bb . where F 1 and F 3 are the multipliers that relate the observed and optimal patterns of neuronal choice correlations in areas x and y. The above equation implies that choice correlations in the active area will increase by a factor proportional to the behavioural effect of inactivating the other area. Intuitively, this is because inactivating an area that was very important for behaviour will dramatically increase the burden on the active area, leading to an increase in the magnitude of choice-related activity.

. (B)
Neurons were linearly decoded by confining readout weights to the leading J eigenmodes of the covariance. Weights were always chosen to be optimal within the decoded subspace, and J was varied from 1 to K where K = 512 denotes the population size. The root-mean-squared choice correlation L MNO over all neurons decreases with J: for this model population, it drops by an order of magnitude already for J = 2. Inset shows L P of each neuron for two example cases. (C) Choice correlations tend to decrease with population size when all modes are decoded optimally (gray: J = K), but remain insensitive to population size when only the leading mode is decoded (black: J = 1). Figure S2. Inactivation effects may not reflect relative influence of brain areas on behaviour. Consider two populations Q and R with relative scaling of neuronal weights S " and S # . These scalings depend not only on the postinactivation thresholds (T ;" and T ;# ) but also on the magnitude of their choice correlations ($ " and $ # ) according to Equation 3. The two panels illustrate the relative choice correlation magnitudes ($ " /$ # , color) for uncorrelated populations (equation 3.1) and correlated populations (equation 3.2), as a function of the scaling ratio S " /S # and the inactivation ratio (T ;" V /T ;# V ). For simplicity, here we assume that $ # = 1, so $ " /$ # = 1 corresponds to optimal decoding. (A) For systems in which the two populations are uncorrelated (! "# = 0), the scaling ratio S " /S # is directly proportional to inactivation ratio T ;" V /T ;# V .
Nonetheless the slope of this relationship depends on the ratio of choice correlation magnitudes $ " /$ # (isochromatic contours), so a population with a larger weight could produce a smaller deficit upon inactivation, or vice-versa (black asterisks). Inactivation effects exactly match the ratio of scalings (e.g. black open circle on the main diagonal) only if decoding is optimal (black dashed line). (B) When the populations are correlated, the scaling ratio is no longer proportional to the inactivation ratio. Instead, their relationship is nonlinear (black dashed line), and the two ratios may not match even if decoding happens to be optimal (e.g black open circle). In other words, the change in behavioral threshold does not match how much each area is decoded.
Here cross-population correlation ! "# is ! "" ! ## /2 for illustration. Figure S3. Direct and conventional methods yield similar neuronal thresholds. Each neuron's threshold was estimated in two ways -directly as the inverse square-root of its Fisher information at ( = 0 (Methods M6 -Equation 6), or using a traditional approach by constructing a neurometric function. The latter approach used ROC analysis to compute the ability of an ideal observer to discriminate between two oppositely-directed headings (e.g., -6.4° vs. +6.4°) based solely on the firing rate of the recorded neuron and a presumed 'antineuron' with opposite tuning 1 . ROC values were plotted as a function of heading, resulting in neurometric functions that were fit with a cumulative Gaussian function. Neuronal threshold was then defined as the standard deviation of the fitted Gaussian, but increased by a factor of √2 to adjust for the extra information from the antineuron. This √2 adjustment arises because a decision based on a neuron-antineuron pair has twice the signal amplitude but also twice the noise variance, compared to a single neuron and a fixed, noiseless 0° reference. Note that this factor of √2 differs from past

Figure S4. (A) Choice correlations of VIP neurons.
Neural recordings were carried out in a separate monkey X prior to inactivation of area VIP, while he performed a heading discrimination task whose structure was identical to that described in Methods in all regards, except each trial lasted only 1s instead of 2s. Similar to those in monkeys C and U, neuronal choice correlations in area VIP are proportional to but greater than those expected from optimal decoding of these neurons during both vestibular (top) and visual (bottom) heading discrimination tasks. The 95% CI of slopes b V were found to be [1.9 2.9] and [1.  The relationship between signal and noise correlation was fit to a linear model (Methods M8 -Equation 7.1) separately for each area, represented here using straight lines. Shaded areas correspond to 95% confidence intervals of the resulting fits. To assess the specific contribution of the leading mode from MSTd and VIP, we considered three different cases: optimal decoding of response along all available modes, a decoder confined to the leading eigenmode in each area, and a spectrum of decoders in between the two extremes. We decoded MSTd & VIP responses separately in all cases using covariance Σ specified by the extensive information model (Figure 4a - . To control n, we simply manipulated the coefficients S i of the optimal decoder, first by setting the leading coefficient S [ to n and then rescaling all the remaining coefficients together so S i = . Weights obtained by this procedure resemble the optimal weight pattern except for the differences arising from the leading mode. We then systematically varied n from 1/K to 1 where the number of neurons K was fixed to 1024 in this simulation. Choice correlations increase slowly with n, and reach half-max at about n = 0.25 (dashed vertical line). (F) Influence of the leading mode on noise in the output increases much more rapidly with n than choice correlations do. For each value of n, we computed the fraction q of total noise variance that comes from the leading mode as q = where l i denotes the eigenvalue of the J st mode. At n = 0.25, more than 95% of noise propagated to the output is inherited exclusively from this mode (dashed vertical line). V correspond to optimal choice correlations in VIP and MSTd, respectively. (B) Performance (threshold) of a decoder with weights inferred from the subspace of two leading principal components of the noise covariance. The black and green lines indicate the performance of the inferred and optimal decoders within this subspace. Inactivating VIP is correctly predicted to have no effect on behavioural performance (blue), while MSTd inactivation increases the threshold (red). Shaded region indicates ±1 SEM. Figure S8. Effect of the decoded subspace dimensionality on performance of the decoder inferred from choice correlations using the extensive information model. Since decoding performance was nearly saturated at 256 neurons (Figure 5c), we fixed the size of the neural population at K = 256, and examined the behavioural threshold when varying the dimensionality of the decoded subspace. Decoding weights were inferred in the subspace spanned by a total of J eigenvectors of the covariance matrix, using J/2 eigenvectors in both MSTd and VIP. The decoder continued to correctly predict the qualitative effects of inactivating MSTd and VIP beyond the 2-dimensional subspace considered in Figure 5, roughly until about J=22 (vertical dashed line). Note that the threshold predicted by the optimal decoder within the restricted subspace (green) improves as more (informative) dimensions are included, while that of the inferred decoder worsens. Therefore readout weights extract more noise than signal from these additional dimensions. This makes sense because if it the weights were instead tuned to decrease the variance in the estimate as more dimensions are added, they would no longer explain the large measured choice correlations.
One reason why the experimental predictions of this model break down for large J is that the predictions are only reliable in the regime of small p where the effect of measurement noise is low. This is because the reliability of inferred decoding weights (and consequently also its predictions) is inversely related to the eigenvalue of the decoded mode, so reliability of the predictions worsens as p increases (Supplementary note S8). Figure S9. Effect of interareal correlations on decoder inferred from choice correlations using the extensive information model. Left: A representative covariance matrix when neurons in MSTd and VIP are mildly correlated through the leading noise modes (! "# ≈ 0.2 ! "" ! ## ). Right: In contrast to the observed effects of inactivation, the decoder inferred using the covariance on the left incorrectly predicted that inactivating VIP should reduce the behavioural threshold. This was unlike the decoder shown in Figure 5c that correctly predicted the effects of VIP inactivation when correlations between the two areas were zero on average.   choice correlations (B) were computed for each neuron across the duration of the trial using a 250ms moving window and averaged across neurons. Note that these readouts predict the choice based only on one time window per data point, and do not perform a weighted sum of responses in multiple windows. Neuronal thresholds in both brain areas were comparable at all times, yet the choice correlations (CCs) differed between brain areas VIP and MSTd in a consistent manner over time. Although CCs in both areas peaked around the middle of the trial, those in VIP were proportionally larger at almost all times. (C) Consequently the slopes, $ = L P /L P,8us , that related observed and optimal choice correlations were generally greater in area VIP than in MSTd. (D) The readout weights inferred using the two models remain largely constant throughout the trial, and are qualitatively consistent with the conclusions drawn from our analyses presented in the main text: the extensive information model implies that area MSTd is underweighted, whereas the limited information model predicts the opposite. Symbols S w and S x denote scaling of readout weights of areas MSTd and VIP respectively. Figure S12. Regression slopes are minimally affected by the length of the analysis window. Both observed neuronal choice correlations as well as those implied by optimal decoding of MSTd and VIP populations increased similarly with the length of the analysis window (not shown). This leaves the regression slopes $ = L P /L P,8us largely invariant with the window length for both VIP (red) and MSTd (blue). Error bars denote ±1 standard deviation. Figure S13. Threshold saturation effects are not influenced by size of the dataset. In the main text, we presented thresholds predicted by decoders inferred using the Extensive information (EI) (Figure 5c) and Limited information (LI) (Figure 6b) models. These thresholds were generated by extrapolating a limited dataset containing 129 and 88 neurons from MSTd and VIP respectively. However those thresholds approached saturation only around 60-70 raising the possibility that those results might be sensitive to the exact number of neurons that were used for extrapolation. To test whether this was the case, we repeated all our analyses by considering only a fraction of the recorded neurons for extrapolation. Thresholds were found to asymptote to nearly the same value obtained by extrapolating the full dataset (compare with Figure 5c). Right: We repeated this procedure for different percentages (10%-100%) and found that our results can be reproduced with as little as 30% of the dataset. The asymptotic thresholds (evaluated at a population size of K = 1024 neurons) do not change much beyond this point (shaded region).
(B) Thresholds implied by the LI model obtained by extrapolating 50% of the dataset. Once again, this was similar to results obtained using the full dataset (Figure 6b). Figure S14. Inferred readout strategy is robust to the degree of inactivation. We extended our model to include two additional parameters y " and y # that denote fractions of neurons inactivated in populations x and y, and derived theoretical results that account for partial inactivation of the two populations (Supplementary Notes S8). We used those results to model partial inactivation of the MSTd and VIP in our dataset, and computed parameter ranges in the (y w , y x ) parameter space (shaded areas) that are consistent with 95% confidence intervals around experimental data.
(A) Extensive information model. Since an empirical trend between neural tuning and noise covariance was used to determine the structure of noise correlations, the readout weights could be uniquely determined from the observed pattern of choice correlations (CCs) independent of the extent of inactivation. Therefore the inferred readout weights remained the same as for the model that assumed complete inactivation (inferred MSTd weight scaling S w = 0.44; optimal MSTd weight scaling S w = 0.74). Nonetheless, the predictions for behavioural thresholds following inactivation of MSTd or VIP (shown in Figure 5c) are quantitatively consistent with the experimental observations (Fig. 1b) only for a specific range of inactivation fractions (grey region). Specifically, the inferred readout weights predict that the thresholds should increase by a factor of 1.6 if MSTd was fully removed, yet the observed increase was only 1.2±0.1. This suggests that MSTd could neither have been completely inactivated nor remained completely intact, leading to the exclusion of the regions close to the left and right boundaries. For the EI model, therefore, partial inactivation of MSTd was a better match to the behavioural data. Similarly, inactivating about half of VIP is predicted to significantly reduce the threshold (Fig. 7c -top panel). Since this was not observed experimentally, the inactivation parameters within the central horizontal band around 0.5 are excluded from the grey region that is consistent with data. Even with partial inactivation, therefore, the extensive information model implies that the brain underweights MSTd compared to optimal, just as reported in the main text where we assumed complete inactivation.
(B) Limited information model. Noise correlations in the limited information model, unlike the extensive information model, were not known a priori, but were instead fit to explicitly account for the behavioural effects of inactivation. Consequently, both the readout weights and the inactivation fractions are jointly constrained by the behavioural thresholds observed after inactivating these brain areas. Thus the set of inactivation fractions consistent with data co-varied with readout weights. Shaded regions represent fraction of cortex inactivated for MSTd and VIP that were consistent with observed behavioural thresholds following inactivation (within 95% confidence intervals) assuming three different values of the scaling of MSTd readout weights (S w = 0.95, 0.85, and 0.75, shown in red, green, and blue). The solution space that was consistent with our data (shaded areas) contracted as the scaling of MSTd weights decreased, with no solutions for S w < 0.74. In contrast to the extensive information model, the limited information model attributes experimental results to overweighting MSTd compared to optimal decoding in all cases (which would have S w within the intervals [0.87 0.93], [0.75 0.81], and [0.6 0.64] respectively, again to remain consistent with 95% confidence intervals of behavioural thresholds), just as we reported in the main text assuming complete inactivation. Thus the qualitative behaviour of the limited information model was robust to incomplete inactivation by Muscimol. Figure S15. Recurrent neural network. We extended our model to incorporate recurrent connections and derived theoretical results relating the connectivity matrix to the behavioural and neuronal effects of inactivation in steady-state (Supplementary Notes S9.1). Recall that decoding weights were inferred in the subspace of the leading eigenmodes of the response covariance. Therefore it is clear that our main results will not be affected by recurrent weights that do not significantly alter neural response along the principal components of covariance in MSTd (M) and VIP (V). Instead, we constructed a specific recurrent scheme that would couple responses along the leading modes (Supplementary note S9.2), and used our theoretical results to test whether there exist connection strengths (c) that leave our main conclusions unaltered. (A) Schematic of a recurrent neural network comprising the two brain areas -MSTd (M) and VIP (V). (B) Recurrent connectivity matrices for the extensive (EI) and limited information (LI) models. (C) Unlike the purely feedforward model, slopes of the tuning curves of individual neurons in this recurrent network are altered when one of the two brain areas is inactivated. (D) Ratio of thresholds after inactivating one of the areas to the behavioural threshold observed in the intact brain, as a function of the overall connection strength (c) between the areas. For appropriate choice of connection strengths (dotted line), the behavioural effects of inactivation are consistent with the experimentally observed outcomes, and nearly identical to the feedforward network for both limited and extensive information models.

S0 Definitions
Choice Probability and Choice Correlation Neuronal choice probability (L= P ) is, roughly, the probability of correctly guessing the behavioural response on a given trial based only on the response X P of that particular neuron k. More precisely, it is the probability that a neural response drawn randomly from one choice-conditioned distribution J(X P |sgn ( > 0) is greater than another neural response from the same neuron but drawn from the other choice-conditioned distribution J(X P |sgn ( < 0), where choice is taken to be the sign of the estimated stimulus (. Choice correlation (L P ) is simply the trial-by-trial correlation coefficient between neuronal responses and the animal's estimate of the stimulus (. For a task with only two possible behavioural responses like heading discrimination, these quantities are related according to 3 : The following equation provides an excellent approximation 3 and was used throughout the paper instead of the above equation.
For convenience, we will express all relations in terms of choice correlations. Corresponding expressions for choice probabilities will follow from equations above.

Noise Covariance
For a population of N neurons, the noise covariance matrix É = ÑÑ Ö − Ñ Ñ Ö is an N×N square matrix whose entries correspond to the correlated trial-by-trial variability of all K(K + 1)/2 possible pairs of neurons in response to repeated presentations of a particular stimulus. Its eigendecomposition is É = ÜáÜ Ö , where Ü is a square matrix whose columns h & correspond to the eigenvectors of É, and á is a diagonal matrix whose diagonal entries l & correspond to the respective eigenvalues.
When considering two brain areas, we will sometimes describe the noise covariance between populations of neurons using a block matrix. Matrices É "" and É ## are used to denote covariances within areas x and y respectively, and É "# is the covariance between areas. Thus the covariance É of the combined population can be written as: É = É "" É "# É "# Ö É ## .

Neuronal weights
The output of a linear decoder is given by = g Ö à å = 1.
We can express these weights g in the eigenbasis of Σ as: where ç & represents the strength of the readout along the direction specified by the i th eigenvector h & (also called the i th principal component of É), and é = ç [, … , ç o .

Population threshold
The performance of a linear decoder can be characterized by the variance ! of its estimate: Another common measure of performance is the sensitivity index, ê å , which is the difference in mean response compared to the standard deviation of the noise. We define the discrimination threshold to coincide with ê å = 1. Then the threshold stimulus change T of the decoder is given by the standard deviation of the estimate, !. When the stimulus affects the neural response mean but not other statistics (i.e. no stimulus-dependent noise correlations), then the Fisher information is exactly equal to the inverse variance of an unbiased, locally optimal linear estimator: ë = 1/ ! (also assuming differentiable tuning curves and nonsingular noise covariance).

S1 Choice correlations implied by optimal decoding
The analytical relationship between choice correlations, population response, and readout weights have been derived in Ref. 3: where ì is an KxK diagonal matrix whose entries ì P = É PP correspond to standard deviations of neuronal responses across trials. For optimal decoding, g 8us = where ë P , ! P , and T P correspond to the linear Fisher information, variance, and threshold of neuron k respectively. This proves equation 2.1.

S2 Choice correlations generated by any generic decoder
Consider a population of N neurons. In this section, we will prove that choice correlations generated by any arbitrary suboptimal decoder of these neurons can be expressed as a sum of components arising from the individual noise modes of the K×K covariance matrix É according to equation 2.2.
Let us first re-express choice correlations obtained by optimal decoding (Equation 2.1 L 8us † where † is an K-dimensional vector of scalar multipliers whose elements are: This proves Equation 2.2 in the main text. Note that elements of † can be estimated by regressing measured choice correlations against individual columns of the matrix of choice correlations L 8us predicted by optimal decoding.
If decoding is restricted to J leading modes of É, then we can similarly prove that choice correlations í generated by suboptimal decoding in this J-dimensional subspace can be expressed as a linear combination of components arising from optimal decoding within this subspace. In other words, í = L 8us i † where L 8us i is an K×J matrix whose columns correspond to the individual components generated from optimal decoding and † is the Jdimensional vector of multipliers whose elements are: where l° is the eigenvalue of the j th mode, and ¢ is the threshold of the optimal decoder within the J-dimensional subspace.

S3.1 Limited information model
Consider two populations x and y that receive limited information from their inputs, producing noise fluctuations that within each local population look exactly like the global signal à å = à " å , à # å . When x and y receive both distinct and shared sources of information, the resulting covariance É £ § can be written as: is the covariance of the information-limiting noise components within and between populations, and É denotes noise covariance that is not information-limiting. We will now show how elements of ¶ are related to the variance ®( V of the estimate ( obtained by optimally decoding responses in x and y (Equation 1).
The variance of an unbiased, locally optimal linear estimator is equal to the inverse of the linear Fisher information 5 , so: Similarly, the variances of estimates ( " and ( # from optimally decoding x and y separately are given by: Equations S3.2 -S3.4 explicitly relate parameters ! "" , ! ## , and ! "# to the variance of optimal estimates (, ( " , and ( # . Note that these variances are simply the square of the optimal behavioural thresholds before and after inactivation: ®( V = T V , ®( " V = T ;# V and ®( # V = T ;" V .

S3.2 Extensive information model
We can similarly define ! "" , ! ## , and ! "# for a rank-two approximation of noise covariance É for the extensive information model. To see this, consider two populations x and y with covariance É "" and É ## . Let h " and h # denote the leading eigenvectors of É "" and É ## , with corresponding eigenvalues l " and l # . Note that these are not the eigenvectors of the full covariance matrix, just of the covariances for each population separately. If, in the full covariance, the leading modes interact to produce correlated noise with strength l "# , we can construct a rank-two approximation of covariance É = É "" É "# É "# Ö É ## as É = ÜØÜ Ö where Ü = h " ß ß h # and Ø = l " l "# l "# l # . Unlike elements of ¶ in Equation S3.1, elements of Ø cannot be directly related to the variance of the output because the latter depends not only on the magnitude of noise (l " and l # ) but also on the signal (h " Ö à " å and h # Ö à # å ). But we can transform Ø to obtain ¶, and express rank-two approximation of covariance É in terms of ¶ as: where ¶ = Ü Ö • ;[ Ø Ü Ö • ;Ö , so the elements of ¶ are related to Ø as: Just like the case of information-limiting noise (Equation

S3
.1), elements ! "" , ! ## , and ! "# determine optimal thresholds according to equations S3.2-S3.4 with one key distinction: whereas those thresholds correspond to the output of optimal decoding in the case of information-limiting noise, they correspond to outputs of optimal decoding only within the subspace of two leading modes in the case of extensive information model. Note that we can use the formulation in Equation S3.5 to derive information-limiting noise (Equation S3.1) as a special case by using h " = à " å / à " å and h # = à # å / à # å to get É = • ¶• Ö .

S4 Effects of suboptimal decoding on behavioural threshold
In section S3, we showed how the optimal thresholds depend on the covariances ! "" , ! ## , and ! "# . We will now investigate how behavioural thresholds are affected by suboptimal weighting of the two populations x and y.

S4.1 Limited information model
Let the combined readout weights of areas x and y be g = S " g " , S # g # where g " and g # correspond to the patterns of weights within x and y respectively that each yield individual unbiased estimates, and S " and S # are the overall scalings on these weights. We define: g = ≥¥ where ≥ = g " ß ß g # , and ¥ = S " , S # . For unbiased decoding of each population separately, as well as together, we require ≥ Ö • = µ and ¥ Ö (1,1) = 1. In this formulation, selective inactivation of a neural population simply redefines the population readout vector ¥. Specifically, inactivating x and y correspond to ¥ = 0,1 and ¥ = 1,0 respectively. Behavioural threshold T is the square root of the decoder variance, so: This proves Equation 5 (Methods M3). When the population readout vector ¥ is suboptimal, the threshold implied by Equation S4.1 will be smaller than the optimal threshold (Equation S3.2). The quadratic form of this equation underlies the U-shaped performance curve shown in Figure 7b.
Similarly, behavioural thresholds following inactivation of either x or y is given by: Therefore the quality of the decoding is determined by the relative weighting ¥ of the response in the two populations when both populations are active. However when one of them is inactivated, the thresholds are near-optimal, limited by noise correlations within the active population.

S4.2 Extensive information model when decoding only dominant noise modes
If decoding is restricted to the single leading eigenmode within each population x and y, then this mode becomes information-limiting in the restricted decoded space. We can express decoding weights as: is a block diagonal K×2 matrix containing the first leading eigenmodes of each area separately, normalized so that Ü Ö • = µ which ensures that the estimators from each population in isolation are unbiased. In this case, behavioural threshold T is once again related to the population readout vector ¥ according to: Likewise for thresholds following inactivation, which are identical to Equations S4.1-4.3.

S5 Effect of suboptimal decoding on choice correlations
We now show that for the limited information model, the pattern of choice correlations is a scalar multiple of the optimal pattern within each population x and y. More generally, if decoding is restricted to the leading eigenmodes within each population x and y, then we can express unbiased decoding weights as g = Ü¥ with Ü defined in Supplementary note S4.2 above. From Equation S1.1 and S3.1, we have: If L P∂ denotes choice correlation of neuron k in population z (which could be x or y), then: where the magnitude of choice correlations is given by the multiplier For the case of information-limiting correlations, we substitute h ∂ = à ∂ å / à ∂ å in Equation

S5
.1 and get: Therefore in the presence of information-limiting correlations, choice correlations of all neurons from a particular population z are a scalar multiple of those resulting from an equivalent optimal decoder with the same behavioural threshold. The two populations x and y could have different multipliers $ " and $ # . This proves Equation 2.3.

S6 Combining choice correlations and inactivation effects
In sections S4 and S5, we showed how behavioural thresholds (T, T ;" , and T ;# ) and multipliers on choice correlations ($ " and $ # ) depend on the relative scaling of weights (S " and S # ). Now we will combine and invert those results to provide a way to infer the scaling of weights from measurements of thresholds and choice correlations.
The ratio of the multipliers $ " /$ # can be written explicitly in terms of the elements of ¶ in Equation S5.2 as:

S6.1 Uncorrelated populations
If populations x and y are uncorrelated, then ! "# = 0. Substituting in Equation S6.1 gives If behaviour is indeed largely driven by responses along the leading modes of variance in x and y, then from Equations S4.2 & S4.3, the post-inactivation thresholds are T ;" V ≈ ! ## and T ;# V ≈ ! "" . This allows us to express the relative scalings of weights purely in terms of relative magnitudes of choice correlations and inactivation effects.

S6.2 Correlated populations
Let populations x and y be correlated according to ! "# = ∏! "" where ∏ denotes the strength of correlations between neurons across the populations relative to those within population x.

S7 Effect of measurement uncertainty
Neuronal weights g are related to choice correlations í and covariance É as 3,4 : g ∝ É ;[ ìí Without loss of generality, we assume (É ;[ ìí) Ö à å =1, so that decoding is unbiased. Any uncertainty in estimating É, ì, or C will all manifest as uncertainty about decoded weights inferred from the above equation. Even under the assumption of a particular noise model (i.e. É and ì are known exactly), uncertainties in measuring í alone can still give rise to uncertainties in g. To show this, we denote the estimated choice correlation by í = í + δí, where í is the true choice correlation and δí is the measurement error. The estimated weights g is then given by: Though errors in ®ç & are relatively small along directions with large noise variance (large eigenvalues r & ), they could be amplified enormously along directions with small noise variance (small r & ). Due to these amplified measurement errors, one can realistically infer only those components of neuronal weights that lie along the first few leading eigenvectors of É ( Figure S8). If the true readout weights lie largely within the subspace spanned by these components, then the inferred readout will be nearly accurate, and the resultant choice correlations will have magnitudes comparable to the measured ones (see Supplementary modeling section S7 of ref. 3 for proof).

Modeling partial inactivation
The results derived in Supplementary notes S4 and S6 assumed that inactivation experiments silence all neurons in the target area. In this section, we re-derive the expressions for decoding weights by relaxing this assumption. To accomplish this, we introduce two additional parameters y " and y # to model the fractions of neurons in areas x and y respectively, that remain following inactivation of those areas.
Note that Equations S8.1 and S8.2 are uncoupled if ! "# = 0. This is the case for the extensive information model, and therefore y " and y # are independent for that model ( Figure  S14A). For the limited information model on the other hand, the above equations provide a joint constraint on S " , y " , and y # and therefore their solutions are correlated ( Figure S14B).

S9 Recurrent network model
Although all theoretical results on choice correlations are agnostic about the choice of network architecture, the specific behavioural predictions of inactivating either brain area derived in sections S4 are not. There, we incorporated the assumption of a purely feedforward model by asserting that the slopes of the tuning curves of neurons in either area remain unchanged following inactivation of the other area. However, in recurrent networks, activity in one area can influence the responses in other areas. If there were recurrent connections between areas x and y, the lack of lateral inputs following inactivation could alter the responses of neurons in the non-inactivated area, possibly rendering the conclusions drawn from the feedforward model invalid. Here, we show that the main conclusions may nonetheless remain true for at least some recurrent networks. We first derive general results that show how neural response and information content are modified following inactivation in the presence of linear recurrent connections. (Note that this general architecture includes decision feedback as a special case, when the readout weight vector of a population is in the row space of the recurrent weight matrix.) We then focus our analyses on a particular structure of recurrent connections and examine the performance of the network by varying only the connection strength between the two areas to demonstrate our point.

S9.1 Effect of inactivation in recurrent networks
Consider the network shown in Figure S15A where responses of neurons in areas x and y are modulated by a constant stimulus s with gain ú " å and ú # å respectively, in addition to receiving inputs from other neurons as determined by the recurrent connectivity matrix A. The responses Ñ are modeled by the following stochastic linear dynamical system: where the connectivity matrix º is a block matrix given by º = º "" º "# º #" º ## , ú å = ú " å , ú # å , Ω ∫~A (0, ae) is zero-mean noise with covariance H, and the subscripts denote discrete time.
The steady-state covariance É of neural responses is given by the following discrete-time Lyapunov equation: É = ºÉº Ö + ae (S9.2) and the steady-state mean of the neural response f(s) is given by: Note that in the absence of recurrent connections, the response covariance is equal to the covariance of the input noise, i.e. É = ae if º = 0. For a given connectivity structure A, knowledge of É can be used to solve for H from the above equation. Covariance in area x (or y) following inactivation of area y (or x) can then be obtained by solving: É "" = º "" É "" º "" Ö + ae "" É ## = º ## É ## º ## Ö + ae ## (S9.4) Similarly, the slope of the tuning curve, à′ is equal to the input sensitivity ú å if ø = ß. Otherwise, for a given ø, sensitivity ú å can be uniquely solved from the slope of the tuning curve as ú å = à′(¿-ø). The slopes à ¡ å and à ¬ å following inactivation of area x and y respectively, can be determined by solving: à " å = (µ − º) ;[ ú " å à # å = (µ − º) ;[ ú # å (S9.5) The above four equations S9.2-S9.5 together allow us to determine the signals à ¡ å and à ¬ å and covariances √ ¡¡ and √ ¬¬ following inactivation, which in turn provide upper bounds on the behavioural thresholds following inactivation: ¢ ;¡ ƒ = ≈/à ¬ å √ ¬¬ ;≈ à ¬ å and ¢ ;¬ ƒ = ≈/à ¡ å √ ¡¡ ;≈ à ¡ å . are the corresponding eigenvalues, and » denotes the connection strength between the areas. In this scheme, the sum and difference modes are amplified and attenuated respectively for »>0, and vice-versa for » < 0. The resulting connectivity structure for extensive and limited information models for J = 4 is shown in Figure S15B. Using this structure, we used equations S9.2-S9.5 to evaluate the effect of inactivation for a range of connection strengths for both models. The ratio of behavioural thresholds after inactivation to thresholds before inactivation is shown in Figure S15D. We found that inactivation of either area affected behaviour differently depending on the strength of connection between areas. Behaviour is predicted to get worse for both models when the connection was inhibitory, whereas behaviour following inactivation was improved if connections were excitatory and strong. This dependence of inactivation effects on connection allowed us to identify a range of intermediate-strength connections whose inactivation effects were similar to the purely feedforward model, and hence also consistent with our experimental results. For these connection strengths, inactivation of either area amplified the tuning curves slopes in both models ( Figure S15C). It should be noted that regardless of the choice of connection strength, the recurrent network yields the same covariance in neural response S by construction. Consequently, the choice correlations and readout weights of neurons in the recurrent network are identical to those implied by the feedforward model.

S10 Effect of selective inactivation on choice correlations in the noninactivated area
Since choice correlation depends both on the neuron's own readout weight as well as the weights of other neurons with which it is correlated 3 , silencing those other neurons is bound to have an effect on its choice correlation. For this reason, when information is distributed across correlated populations, selectively inactivating one of them will naturally affect choice correlations in the non-inactivated area. Here we consider two populations x and y and show how choice correlations in each should change following inactivation of the other area.