Redundant representations are required to disambiguate simultaneously presented complex stimuli

A pedestrian crossing a street during rush hour often looks and listens for potential danger. When they hear several different horns, they localize the cars that are honking and decide whether or not they need to modify their motor plan. How does the pedestrian use this auditory information to pick out the corresponding cars in visual space? The integration of distributed representations like these is called the assignment problem, and it must be solved to integrate distinct representations across but also within sensory modalities. Here, we identify and analyze a solution to the assignment problem: the representation of one or more common stimulus features in pairs of relevant brain regions—for example, estimates of the spatial position of cars are represented in both the visual and auditory systems. We characterize how the reliability of this solution depends on different features of the stimulus set (e.g., the size of the set and the complexity of the stimuli) and the details of the split representations (e.g., the precision of each stimulus representation and the amount of overlapping information). Next, we implement this solution in a biologically plausible receptive field code and show how constraints on the number of neurons and spikes used by the code force the brain to navigate a tradeoff between local and catastrophic errors. We show that, when many spikes and neurons are available, representing stimuli from a single sensory modality can be done more reliably across multiple brain regions, despite the risk of assignment errors. Finally, we show that a feedforward neural network can learn the optimal solution to the assignment problem, even when it receives inputs in two distinct representational formats. We also discuss relevant results on assignment errors from the human working memory literature and show that several key predictions of our theory already have support.

Below, we have provided a detailed response to each reviewer comment as well as a reference to the line in the paper where the comment is addressed. We have reproduced all text from the reviews in blue; the text from our revised manuscript is given in green. The line numbers refer to lines in the tracked changes version of the manuscript. For ease of reference, the citations in quoted parts of our revised manuscript are reproduced at the bottom of the document here.
2 Detailed response to reviewers 2.1 Reviewer 1 In this study, Johnston and Freedman proposed a theoretical solution for how distributed representations are integrated (the assignment problem): that one or more common stimulus features are represented in pairs of relevant brain regions. They further implemented the solution in a biological plausible receptive field code, and showed the tradeoff between assignment errors and redundancy. Lastly, they trained a feedforward neural network to show that the optimal solution can be learned when an integration layer was included.
Overall, I think this is a technically sound study that addresses an important question in the field, and has potential implications for experimental work in different domains. Meanwhile, I have a few clarification questions: We thank the reviewer for their thoughtful review.

2.1.1
1. This work focuses on theoretical solutions of the assignment problem, but I think it would be greatly helpful if the authors could provide more concrete examples, or more concrete predictions, of the solution. For instance, it is reasonable to expect such integrated representations to be found in multi-sensory associative regions, in the example of visual-auditory integration; but how exactly would common features develop and integrate in the case of multi-region representations within a single sensory system? From how I read it, the authors were referring to regions in different sensory pathways (e.g., the dorsal and ventral visual pathways, as in ref. 30); however, the authors also talked about regions within a sensory pathway (e.g., along the ventral visual stream, as in ref. 49). So, making it more explicit, and possibly discussing where in the brain to expect such integrated representations, could help.
We have included additional detail about and motivation for our predictions in an expanded section of the discussion: (line 454) Second, our work shows that explicit bound representations of the integrated and common features may be a necessary step in solving the assignment problem. Following previous work [1][2][3] and our neural network results, we believe that these bound representations will be instantiated through a conjunctive population geometry for the bound features -similar to the multi-dimensional receptive fields used above (though the representation need not have exactly this form). Thus, for a multi-sensory classification task, we expect to find these bound representations in multi-sensory association areas (such as posterior parietal cortex [4,5], ventrolateral prefrontal cortex [6], and the superior temporal sulcus [7]; also see [8]). For complex visual decision-making tasks where integration across the dorsal and ventral visual streams is required, we expect to find bound representations in brain regions receiving input from both streams (such as prefrontal or posterior parietal cortex [9,10]) as well as regions within each stream that receive information from the other stream. For instance, recent work has shown combined representations of visual form and motion in the inferotemporal cortex [11], though their geometry was not characterized (i.e., it is unknown whether the representations are bound). Similar integrated representation of visual form and motion have also been found in the superior temporal polysensory area [12] (and in the middle [13] and body-selective patches within [14] the superior temporal sulcus).
Third, our framework also makes predictions that can be tested with population recordings from uni-sensory brain regions, during the performance of a multi-sensory task (or, at least, a task that is believed to involve assignment). Our work predicts that behavioral assignment errors -i.e., a swap error [2,15,16], described above -will be correlated with errors in the representation of the common features specifically in the direction of other stimuli. For instance, in a visual-auditory integration task with multiple stimuli, we expect that trials in which a position decoder for one of the stimuli makes errors toward another stimulus will also be associated with a greater probability of behavioral assignment errors. We do not expect errors in the representation of non-common features to have this same correlation with assignment errors -instead, we expect that they will be associated with local variance in the animal's report and would be detectable in continuous report tasks, such as [17].
We have also included additional discussion of the dorsal-ventral case of assignment (within the quote above), beginning on line 465.

2.1.2
2. In Section 2.4, the authors discussed two errors that contribute to combined MSE: local and threshold errors. The tradeoff between the two errors is critical in determining the optimization of MSE. However, in the following Section 2.5, all analyses were restricted to RF codes with negligible threshold error rates, making the assignment problem a lot simplified. I did not find the rationale for doing this, and I would like to see more clarification and discussion on this issue.
We thank the reviewer for their attention to this point. We now discuss how threshold errors interact with the assignment process in the paper, (line 286) Next, we link these RF codes to our theory of assignment errors. The local variance of the RF code is equivalent to the local variance in our framework. However, the threshold errors have no clear analogue in our current theory. Since the inferred stimulus produced from a threshold error is uniformly distributed across the whole stimulus space, they are extremely disruptive to assignment: The optimal strategy is not to integrate the threshold error representation during the assignment process at all. This unbalanced version of the assignment problem has been studied in combinatorics [18], but is beyond the scope of the current work. However, threshold errors are typically unlikely once the optimal RF width has been chosen. In particular, we find that codes with non-negligible threshold error rates also tend to have high total error. To proceed, we exclude codes with high total error, both due to the issues described for threshold errors and because our analytic calculation for the assignment error rate assumes local variance that is small relative to the size of the stimulus space.
We had initially excluded codes with non-negligible threshold error rates due to this complication. When we returned to include them, however, we found that the only codes that have high threshold error rates also have high local variance (MSE) and therefore high assignment error rates. All of the same codes are excluded when we exclude codes with high total error. We believe that this is necessary since our calculation of the assignment error rate requires local variance (MSE) that is small relative to the size of the stimulus space. Thus, we have excluded codes in section 2.5 with total error greater than half the size of the stimulus space, and noted this condition in the text, (line 332) Codes with total error that exceeds half the size of the stimulus space are excluded from this analysis for the reasons discussed above.

2.1.3
3. Variants of the assignment problem have been studied in previous literature, such as the feature binding problem (e.g., Schneegans & Bays, 2017), and the causal inference problem (e.g., Kording et al., 2007). Although these alternative models were briefly introduced in Introduction, a more thorough discussion on the similarities and differences between the current and previous models would still be necessary. For example, the paper discussed successful integration and incorrect integration (i.e., swap error as in feature binding), but how would segregation (e.g., separate sources of visual and auditory information as in causal inference) occur and quantified in the current solution?
The reviewer is correct that we have focused on what we call "balanced" assignment in this manuscript, where there is a single, correct one-to-one mapping between sets of representations of equal size. However, we agree that the unabalanced case is also interesting. We have added an expanded discussion of this case to the Discussion, with some notes about how our framework can be generalized to that case, (line 534) We have also discussed the case of balanced assignment between two brain regions, where each region represents the same number of stimuli and there is a correct one-to-one mapping between those stimuli. However, this is a simplification: in many cases, there may be more representations in one region than the other (e.g., speakers who are not visible). Or, even when there are the same number of representations, a one-toone mapping may not be correct (e.g., a speaker behind the subject and another person in view who is not speaking). This latter situation has been studied for a single auditory and single visual stimulus [19][20][21]. This work found evidence that human observers infer a single cause for the two percepts only when they are nearby in space; otherwise, they infer independent causes -and further shows that this process is well-modeled by Bayesian inference, depending on the local variance of auditory and visual position estimates as well as the prior rate of common relative to distinct causes [21]. Future work can combine our approach to the assignment problem with this inference of the number of distinct causes in the environment. This could be achieved by considering more potential mappings between the sets of representations, along with a prior giving the strength of the expectation that each representation will be integrated. This generalization of our framework would lead to additional predictions for experimental data with unbalanced numbers of stimuli across distinct sensory modalities.
We have also expanded our discussion of the alternative models, (line 437) This work extends previous work on the assignment problem in several ways. First, previous studies have considered either a single stimulus in both modalities (one auditory and one visual stimulus in [21]) or a single stimulus that must be assigned to one stimulus from a larger array (using a cue to select a single stimulus from a remembered array in [1,2]). Here, we have shown how the rate of assignment errors scales as more stimuli need to be integrated. Second, previous work used a single common feature for binding [1,2,21], while we have shown how assignment error rates scale for additional common features. Third, the formulation of the neural code that we use is similar to that used in [2]; however, we have derived novel closed form expressions for the total, local, and threshold error rates of these codes -using these expressions we investigate a wider variety of different neural architectures (e.g., splitting representations across different regions) than have been considered in previous work. Finally, we believe that our neural network approach to producing assigned representations is novel, as previous work only characterized the error rates that would be expected from an ideal observer [1,2,21]. Thus, we believe that this work contributes to our understanding of the assignment problem, building on top of this foundational work.

Reviewer 2
This manuscript describes a computational solution to the problem of assigning multiple properties (represented by distinct regions/modalities) to multiple objects, a problem related to causal inference in multisensory integration but with a slightly different twist. The approach seems elegant and creative, the results are interesting and the paper very well written. I think it should be published and will be of interest to many.
We thank the reviewer for their kind comments about our manuscript.

2.2.1
However, although logically and mathematically sound, as a theory of brain function I found it somewhat lacking. The premise as explained out of the gate seems a little farfetched and possibly circular. It assumes a kind of object-level organization that may not exist in most sensory cortices, at least not early/primary areas. Labeling the relevant visual area V1 in Fig. 1A (instead of, say, IT) for high-level representations like dog/cat is of course incorrect (not sure about A1), which doesn't make a good first impression. That's a minor thing, but I think the issue is deeper one, and relates to two poorly defined concepts: 'object' and 'region' (see below). I also have a third general comment about the theory's (in my view insufficient) contact with neurophysiology and anatomy.
We have changed the example given in the introduction as follows: (line 39) Coherent behavior in complex natural environments requires extensive and reliable integration of different forms of information about the world. For instance, a pedestrian crossing a crosswalk during rush hour attends to the flow of traffic around the intersection -if they hear honking, it can be important to quickly localize the honking cars and decide whether they need to change their motor plan. While navigating cluttered multisensory environments like this one often appears effortless for humans and other animals, it requires two highly non-trivial computations: Object segmentation and representation assignment.
We have also ammended the labels to be ITC and A1. We agree that this is more accurate for representations of animals. We agree also that the concepts of object and region are difficult and lead to some confusion and apparent circularity. We have also clarified what we mean by brain region: (line 160) Finally, throughout this work we organize our discussion using the concept of distinct brain regions, by which we mean distinct populations of neurons that may share information with each other through anatomical connections. We primarily use this concept for simplicity of exposition. In general, all of the same principles apply when different features (or combinations of features) are represented in distinct subspaces of the same neural population activity. That is, the assignment problem will also need to be solved to integrate across these distinct subspaces, even though the subspaces do not correspond to distinct populations of neurons.
In addition, we now emphasize that these assigned representations are necessary for decision-making in the presence of multiple options: (line 363) In general, we believe that the assignment problem is solved to make decisions that require the linking of two or more stimulus features (e.g., asking what the value of feature X is for stimuli with a particular value of feature Y , as in the human working memory studies discussed above).
This contextualization of the assignment problem in a decision-making process is now mentioned in other parts of the paper as well.
We do believe that assigned representations of certain (particularly decision-relevant) features are necessary to avoid the ambiguity arising from multiple representations that we describe in the paper, but we agree that platonic object representations may not exist. We have attempted to clarify that throughout without compromising clarity of exposition.

2.2.2
1. First, the conception of the assignment problem assumes that the segmentation and binding problems (within a modality) have already been solved, and all that remains is to assign features to the appropriate objects. I wrote a long diatribe about this before realizing that the paper covers it concisely in the Discussion -shame on me for not reading all the way through first, but it means that many readers will have this impression too, and you should confront this limitation earlier, i.e. in the Introduction. I still think it might weaken the premise due to circularity, since one could argue segmentation and binding already require a solution to the assignment problem. But I suppose that's no reason to hold up publication; the community can be the judge.
We agree that this is important to clarify. We have added a note about representations of bound features to the Introduction, (line 114) In the feedforward network, we also show how the integrated stimuli can be reliably represented: through nonlinearly mixed representations of the integrated features, which follows from previous work on representations of multiple stimuli [1][2][3].
We believe that assignment, segmentation, and binding are all interleaved and now discuss this more in the manuscript, (line 522) The simultaneous representation of multiple stimuli in neural activity is not fully understood. In particular, while we have provided a solution to the assignment problem, which must be solved when integrating representations of multiple stimuli that are distributed across different brain regions, distinct neural population subspaces within a single brain region, or even time points. However, the assignment problem is only one of several difficulties that arise when multiple stimuli are represented simultaneously. One additional problem is the segmentation, or clustering, of sensory information into a discrete set of causes (i.e., stimuli or objects). While there is extensive work on this process in psychology [22][23][24] and in machine learning [25,26], the neural mechanisms are not fully understood (but see [27,28]). A second additional problem that we have already mentioned is the representation of bound stimulus features in a single population of neurons; in this work, we have assumed that this representational problem has already been solved, perhaps by the conjunctive feature representations posited by other work [1][2][3] (but see [29][30][31][32] for alternatives). In the real brain, each of these problems is likely solved at many different stages of sensory processing (or even simultaneously), where lower-level components are clustered into higher-level features (e.g., combining points into an edge), a correct binding of those features is inferred (e.g., that edge is combined with a representation of its motion), then that bound set of features is represented (and can become a new, higher-level feature itself -e.g., perhaps the edge is part of a running dog). While we have focused on one part of this process, future work can integrate these parts and more fully explore how they interact -both with each other, and with other concerns (such as how multiple representations in a single neural population interact with each other, see [33]). This unifying work will be necessary to develop a full understanding of the rich ways that simultaneously representing multiple stimuli constrains the neural code.

2.2.3
2. Second, the notion of a 'brain region' (especially critical for section 2.5), and its relationship to perception, is underdeveloped. The authors write: "The assignment problem must be solved when stimulus features are represented in distinct brain regions" -Why? Because perception requires convergence of information onto a single region (or single neurons)? That's a big assumption. What region might that be? Isn't our modern understanding that perception and behavior emerge from the collective activity of many interconnected populations/regions? If so, it becomes harder to motivate the problem, I think.
And even if 'region' is the right level of anatomical organization to think about here, how do you define the term? Neural populations are separated spatially but connected synaptically-do two highly interconnected but spatially disparate populations count as one region? What about two very nearby but only weakly connected populations? Or is a region defined by a complete representation of some stimulus dimension, like how boundaries between visual and somatosensory areas are defined by where the retinotopic or somatotopic map mirror-flips and begins to repeat? This works for most early sensory cortices but probably not later ones (e.g. where 'dog' and 'cat' live).
We agree that brain regions are a slippery concept, and have included a working definition in the text (also quoted above), (line 160) Finally, throughout this work we organize our discussion using the concept of distinct brain regions, by which we main distinct populations of neurons that may share information with each other through anatomical connections. We primarily use this concept for simplicity of exposition. In general, all of the same principles apply when different features (or combinations of features) are represented in distinct subspaces of the same neural population activity. That is, the assignment problem will also need to be solved to integrate across these distinct subspaces, even though the subspaces do not correspond to distinct populations of neurons.
We have also discussed how assignment can unfold over time instead of across distinct brain regions, (line 486) In general, we expect that these intermediate representations will emerge prior to a pure representation of the decision variable -both in brain regions that are earlier in an established processing hierarchy as well as temporally earlier within brain regions that ultimately express representations of the decision variable. While the results from our neural network study underline the importance of this bound representation for computing the decision variable ( Figure 5), a single neural population could simultaneously represent these distinct forms of information in separate subspaces of activity. These bound representations have also been identified as important for representation recoding [34]. Future work could identify whether these recoding and assignment operations can be performed simultaneously in the same integrated representation.
In general, we agree that the collective responses of populations of neurons (which need not be in a particular anatomical region) are the important unit of computatation -and we believe this is consistent with transitioning from an non-assigned representations (which have a certain geometry) and assigned representations (which have a different geometry). We discuss these geometries more in the discussion as well, (line 472) Following previous work [1][2][3] and our neural network results, we believe that these bound representations will be instantiated through a conjunctive population geometry for the bound features -similar to the multi-dimensional receptive fields used above (though the representation need not have exactly this form).
This same section is also quoted above.
The point is not that this paper needs to solve these quasi-philosophical issues (although there are some philosophers out there who might be worth consulting). But it could be made clearer what is assumed about the 'read-out' of neural activity that necessitates the solution you propose. In other words, after the assignment problem is solved, what sort of neural representation are you left with, and what is the next step in the causal chain to behavior?
In the new sections quoted above, we have emphasized the geometry of the assigned neural representation as well as the fact that assigned representation can then be transformed into the representation of a decision variable, which can then be used to guide a decision.

2.2.4
3. Lastly, the theory could be better linked to the neurobiology, assuming that is a goal the authors seek. Aspects of this issue have been touched on above: defining regions better and being more up-front about the requirement for object-level representations (implicating later stages of processing). But when it comes to specific predictions for experiments, the relevant paragraph in Discussion (387-396), although a good start, comes up short.
For instance, "Brain regions that commonly represent multiple stimuli will all have some redundancy with other brain regions" -that's a bit vague and probably trivially true (or already known to be true), therefore not very useful as a prediction.
And, "Our framework can be used to predict assignment error rates based on the level of redundancy between two brain regions" -how exactly would that play out? Can you be more specific?
We agree that the "level of redundancy" statement is a little too vague -and relies in part on an approach that we removed from the manuscript prior to submission for coherence, so we have removed it.
We have worked to make our other predictions more precise, as well as expanded on some more specific analysis approaches. Such as, (line 459) In some cases, the redundant information between two regions may be clear due to experimental design. However, in cases where redundant features are not readily identifiable, population analyses can be used to discover redundant information directly by building models that explain the activity in one region in terms of the activity in the other. Then the activity explained by this approach can be modeled in terms of known experimental variables. For instance, in this way, common features across the parallel ventral and dorsal visual streams in the primate visual system can be discovered (though some features, such as spatial position, have already been shown to be redundant across the two streams [35]). and (496) Third, our framework also makes predictions that can be tested with population recordings from uni-sensory brain regions, during the performance of a multi-sensory task (or at least a task that is believed to involve assignment). Our work predicts that behavioral assignment errors -i.e., a swap error [2,15,16], described above -will be correlated with errors in the representation of the common features specifically in the direction of other stimuli. For instance, in a visual-auditory integration task with multiple stimuli, we expect that trials in which a position decoder for one of the stimuli makes errors toward another stimulus will also be associated with a greater probability of behavioral assignment errors. We do not expect errors in the representation of noncommon features to have this same correlation with assignment errors -instead, we expect that they will be associated with local variance in the animal's report and would be detectable in continuous report tasks, such as [17].
We have also been more specific about regions expected to have the assigned representations we describe, in both the multisensory and visual contexts, (line 475) Thus, for a multi-sensory classification task, we expect to find these bound representations in multi-sensory association areas (such as posterior parietal cortex [4,5], ventrolateral prefrontal cortex [6], and the superior temporal sulcus [7]; also see [8]). For complex visual decision-making tasks where integration across the dorsal and ventral visual streams is required, we expect to find bound representations in brain regions receiving input from both streams (such as prefrontal and posterior parietal cortex [9,10]) as well as regions within each stream that receive information from the other stream. For instance, recent work has shown combined representations of visual form and motion in the inferotemporal cortex [11], though their geometry was not characterized (i.e., it is unknown whether the representations are bound). Similar integrated representation of visual form and motion have also been found in the superior temporal polysensory area [12] (and in the middle [13] and body-selective patches within [14] the superior temporal sulcus).
What about RF codes? Can you say anything about (for example) how the width of RFs in real data compare to what is 'optimal' or predicted by the model? This would probably require defining a behavioral task and feature space in more concrete terms, but some intuitive/qualitative attempt at this could be valuable and increase the interest level in the field.
While we do not feel comfortable making a specific prediction related to RF width (due to its dependence on many factors outside of assignment), we have emphasized our prediction about the conjunctive nature of the potentially RF-like codes underlying assigned representations, (line 472) Following previous work [1][2][3] and our neural network results, we believe that these bound representations will be instantiated through a conjunctive population geometry for the bound features -similar to the multi-dimensional receptive fields used above (though the representation need not have exactly this form).
I gather that the manuscript overall is written intentionally to be as general as possible, but in the name of generality it may have missed an opportunity for more concrete biologically informed linkages/predictions that will excite experimentalists and theorists alike.
We hope that our changes to this section of the manuscript (many of which are reproduced above) has better navigated this tradeoff between generality and specificity. We have made this change.

Specific/Minor Comments
line 90, and throughout -Why use MSE instead of variance (more common in the field?), given that the estimates are assumed to be unbiased?
We have changed over to local variance instead of local MSE.  This was due to not having enough samples for that error rate. We have re-generated the panel with more samples.
line 244 -what does Fisher Info have to do with metabolic cost?
We have clarified this line, (line 260) To see this, we begin by deriving the average Fisher information of a random RF code with a particular number of neurons and a particular average level of activity across the whole population (i.e., a budget of metabolic energy to be used for spiking)... line 592 -"there is evidence" -can you cite some here?
We have added citations to this statement -specifically to some of the work on the two primate ventral and dorsal visual streams.
Several figures are described as comparing theory (analytical calculations?) with simulations, but I don't recall seeing where it is explained how the simulated data are generated. Do they come from a generative process described by the equations, and if so, isn't it guaranteed that they would match the 'theory' in the limit of large Nsims?
We have added sections to the methods explaining how the simulations assignment error and RF code error simulations are done. If our analytical calculations are correct and our simplifications are appropriate, then there should be perfect agreement between the lines (and indeed this is more or less what we find). Since we do make several simplifications in the course of the analytical calculations, though, we thought it was helpful to include the simulations to reassure the reader (and ourselves).