Nonlinear visuoauditory integration in the mouse superior colliculus

Sensory information from different modalities is processed in parallel, and then integrated in associative brain areas to improve object identification and the interpretation of sensory experiences. The Superior Colliculus (SC) is a midbrain structure that plays a critical role in integrating visual, auditory, and somatosensory input to assess saliency and promote action. Although the response properties of the individual SC neurons to visuoauditory stimuli have been characterized, little is known about the spatial and temporal dynamics of the integration at the population level. Here we recorded the response properties of SC neurons to spatially restricted visual and auditory stimuli using large-scale electrophysiology. We then created a general, population-level model that explains the spatial, temporal, and intensity requirements of stimuli needed for sensory integration. We found that the mouse SC contains topographically organized visual and auditory neurons that exhibit nonlinear multisensory integration. We show that nonlinear integration depends on properties of auditory but not visual stimuli. We also find that a heuristically derived nonlinear modulation function reveals conditions required for sensory integration that are consistent with previously proposed models of sensory integration such as spatial matching and the principle of inverse effectiveness.


Introduction
Integrating stimuli from different modalities that occur at the same location and time improves our ability to identify and respond to external events [1]. The superior colliculus (SC, homologous to the tectum in lower vertebrates) contains a topographic representation of visual space in its superficial layers (sSC) that is aligned with maps of auditory space in its deeper layers (dSC), and has been used as a model to elucidate general principles of how input from visual and auditory sources integrate [1][2][3]. For example, the barn owl optic tectum has been a model to elucidate the mechanisms used during development to align the visual and auditory spatial maps [4,5] as well as to determine the circuitry used to modulate visuoauditory responses [6,7]. Studies primarily using anesthetized cats have shown that the mammalian dSC also contains neurons that respond to both visual and auditory stimuli (visuoauditory neurons) [8]. Some of these neurons exhibit superlinear multisensory enhancement, a type of multisensory integration (MSI) whereby a neuron's response to a combination of stimuli is greater than the linear sum of the individual unimodal responses [8][9][10][11][12][13][14]. These studies, performed at the single neuron level, have determined three features of the stimuli that are required to induce superlinear multisensory enhancement: 1) 'Spatial matching', whereby both stimuli need to come from the same direction in the sensory field; 2) 'temporal matching', whereby the visual and auditory stimuli must overlap in time; and 3) 'inverse effectiveness', in which multisensory enhancement is weakened when the strength of the stimuli is increased [1,8].
While studies based on the owl tectum and cat SC have many advantages, these models are currently limited in their ability to manipulate genes and the activity of neurons, both of which can be used to determine the underlying circuitry that promotes sensory integration and elucidate the mechanisms by which it develops. The mouse is well suited for such analysis, but it is not known if the mouse SC contains visuoauditory neurons that exhibit MSI. Here we used high-density silicon probe recordings in awake mice, presented with simultaneous visual and auditory stimuli, to characterize the MSI properties of mouse SC neurons. We found that the receptive fields (RFs) of visual/auditory bimodal neurons in the dSC overlap and are organized into a topographic map of azimuthal space with some bimodal neurons exhibiting MSI. We also found that the auditory responses of these neurons have early and late temporal components, with only the late component being non-monotonic to the stimulus intensity. To determine the stimuli needed to invoke MSI and the properties of integration, we created a population-level model of the visuoauditory responses of simultaneously recorded individual neurons in which nonlinearity is given to neurons through a single modulation function that is shared by all the neurons in a dataset. The resulting modulation function quantifies the conditions needed for MSI and recapitulates the spatial matching requirements of the stimuli needed to invoke MSI, as well as the inverse effectiveness principle.

The receptive fields of visual/auditory neurons in the deep SC are aligned and form a topographic map of space
In order to determine how visual and auditory information is organized in the mouse SC we used silicon probe electrophysiology to record the spiking activity of SC neurons in awake behaving CBA/CaJ mice in response to visual and auditory spot stimuli, and then determined their spatial RFs to each stimulus (Fig 1A and 1B). We recorded a total of 1281 neurons from 8 mice. The spatial distribution and the fraction of cells that are responsive to visual stimuli, auditory stimuli, and both are shown in Fig 1C and 1D. Consistent with what is known about the location of visual and auditory neurons in the mouse SC [15][16][17][18][19], we find that visually responsive neurons are located in both the superficial and deep SC while auditory responsive neurons are found only in the deep (≳ 400 μm) SC. Many bimodal neurons have overlapping spatial RFs. Fig 1E-1G shows spatially matching visual and auditory RFs of an example neuron. When we plot the RF azimuth vs. SC anteroposterior (AP) location of the neurons, we find that both visually and auditory responsive neurons exhibit a strong correlation between their physical location and their RF location; the visual map (Fig 2A) has a (77 ± 3)˚/mm slope and (-1.6 ± 1.9)˚offset, while the auditory map ( Fig 2B) has a (73 ± 8)˚/mm slope and (10 ± 5)˚offset. This demonstrates that mice, like other mammals, have aligned visual and auditory maps of azimuthal space in the SC [4,18,19].
When we compared the visual and auditory RFs of the 62 bimodal neurons identified in our recordings, we found that (95 ± 3) % (59) of them have overlapping visual and auditory RFs along the azimuth if the RF boundary is defined as 1 σ (Fig 2D). For (76 ± 5) % (47) of these neurons, the auditory RF is larger than the visual RF. The correlation coefficient between the visual RF centers and auditory RF centers of all visuoauditory neurons is 0.77 ( Fig 2C). The bimodal neurons only appear in the dSC where most neurons have relatively large visual RFs (≳ 10˚) and their visual/auditory RF azimuthal extent does not correlate with the depth of the cells (Fig 2E and 2F). We did not find properties of bimodal neurons that depend on their location along the depth of the SC. These results demonstrate that the mouse dSC contains neurons that receive both visual and auditory information from the same azimuthal direction, and that they are organized as a topographic map along the A-P axis of the dSC.

Unimodal visual and early auditory SC responses are monotonic; late auditory SC responses can be non-monotonic
As a first step to understand how visual and auditory responses interact, we stimulated the mouse with different contrast or sound levels and analyzed the temporal structures and intensity tuning of unimodal visual and auditory responses (Fig 3A and 3B). Similar to what has been reported in the ferret [20], we found that auditory SC responses can be temporally segregated into early (< 20 ms) and late (> 20 ms) components ( Table 1). When we measured the intensity tuning curves of the early and late responses, we found that all of the visual (100% (178 / 178)) and nearly all ((99.7 ± 0.3) % (357 / 358)) of the early auditory responses are monotonic in nature, meaning the firing rate of the neuron increases with increasing stimulus intensity (Fig 3D, 3G, 3E and 3H). However, the late auditory responses of some neurons ((11.4 ± 1.9) % (33 / 290)) were non-monotonic (Fig 3F and 3I). Here, we call a response 'non-monotonic' when any of the non-maximum strength stimuli elicit a significantly (p < 0.001, ANOVA) higher firing rate than the maximum strength stimulus (nonmonotonic cells are marked as red in Fig 3G-3I), and otherwise 'monotonic' (these may not satisfy a strict mathematical condition of monotonicity; however, our focus here is to highlight clear non-monotonic responses of a small set of neurons). The temporal profiles of visual and auditory responses demonstrate that auditory signals arrive at the SC earlier than visual signals; therefore, if the visual and auditory inputs are detected by the retina and ears at the same time, the SC circuit may already be conditioned (modulated) by the early auditory processing (either in the SC or elsewhere) before the visual input arrives (S1 Fig). This modulation could also explain the non-monotonicity of the late auditory responses.

Modeling the multisensory responses reveals an audition-dependent modulation structure
The above results are consistent with the hypothesis that auditory signals processed in the first 20ms influence nonlinear processing that occurs later (>20ms). If this hypothesis is correct, nonlinear multisensory processing should depend on the properties of the auditory, but not the visual stimuli. Testing this hypothesis requires detailed analysis of the response in high dimensional stimulus space. Simple approaches such as single post-stimulus time histogram analysis, or significance tests that rely on responses to single patterns are not effective. Therefore, to determine how the early and late responses interact, we used a maximum-likelihood fit analysis of models that include a simple sigmoidal response plus an auditory or visual stimulation-dependent modulation term and asked which model better fits the data. Briefly, the model illustrated in Fig 4 (see Materials and methods) consists of a baseline firing rate (b) plus sigmoidal visual and auditory responses (S), multiplied by a nonlinearity term (modulation coefficient: α δ ) that depends on the properties of the auditory stimulus. With this formulation, the nonlinear multisensory interaction is summarized in α δ and its effect on the response (R) can be parameterized from multiple stimulus patterns. The global modulation function δ is defined for each mouse, and applied to all the neurons in the mouse, and in this way, it captures the spatial extent of the modulation in the SC circuitry (see Materials and methods for details).   Fig 5A-5E summarizes the responses of a representative neuron to different levels of visual and auditory stimuli coming from a specific azimuth. The firing rates in the late integration window (Fig 5C) is compared with the linear additions of the unisensory stimuli ( Fig 5D). For this neuron, the multisensory firing rates are higher than the addition of the unisensory responses when the auditory stimulus is weak (intensities 1 and 2, that correspond to 20 and 30 dB; exhibiting nonlinear multisensory enhancement (MSE)), and lower when the auditory stimulus is strong (intensities 3 and 4, that correspond to 40 and 50 dB; exhibiting nonlinear multisensory suppression (MSS)). The model recapitulates both MSE and MSS, including its dependency on the auditory intensity (Fig 5G-5I).
To test the hypothesis that nonlinear multisensory processing depends on the properties of the auditory stimuli, we modeled the nonlinearity term as a function of the auditory stimulus properties (audition dependent modulation; AM-model), the visual stimulus properties (vision dependent modulation; VM-model), or fixed to 1 (Linearly added monotonic sigmoids; Lmodel), and tested how well each model describes the data. To quantify the goodness-of-fit of  these models, we used the deviance per degree of freedom (D/DF; this value becomes asymptotically close to the reduced χ 2 in the limit of a large data sample (Wilk's theorem); a D/ DF = 1 is a perfect fit; the outlier deviance, D/DF-1, provides quantification of the data that the model does not describe, see Materials and methods). We calculated the D/DF of each model and summarized it in Fig 6A. If we think of the L-model as the baseline, the AM-model exhibited a dramatic improvement of the outlier deviance (0.72 to 0.173), while the VM-model does not (0.72 to 0.66). This confirms the above hypothesis that the nonlinear processing of multisensory responses is better described by the modulatory properties of auditory compared to visual stimuli.
The modulation function of the model provides a quantitative description of the multisensory integration response properties for all neurons in the data set under a wide variety of auditory stimuli. Fig 6B and 6C shows the average shape of the modulation function from 8 datasets (those for each dataset are shown in S3 Fig) and has two important features. First, it has a bounded spatial distribution with a positive center and a negative surround. Second, more intense auditory stimuli cause stronger inhibition than less intense auditory stimuli. These two properties result in the modulation function being positive only near the location of the auditory stimulation, and only if the auditory stimulus is weak. The first property is known  as spatial alignment of different sources (or surround suppression in other contexts [21,22]); the second property is known as the principle of inverse effectiveness [1]. These two previously described phenomena are now quantified with one modulation function, and our modelbased analysis finds them heuristically, without prior knowledge of the system.

The AM-model improves the identification of neurons with multisensory processing
The utility of modeling is that it allows us to compress data from a high-dimensional stimulus space into a low-dimensional model parameter space. This overcomes the difficulty of analyzing and testing the significance of the neurons' response properties created by the large number (168) of visual and auditory stimulus patterns to identify those that exhibit multisensory interactions (see Materials and methods). For example, we used four variables-the locations and intensities of the visual and auditory stimuli-to make our stimuli (Fig 4), thus creating a four dimensional space and 168 different patterns of stimuli. The resulting data are not only difficult to visualize but also suffer from the statistical issues of multiple comparisons. By using the AM-model to identify neurons that exhibit multisensory interactions we find that 47% (190 / 403) of the modeled neurons exhibit a significant modulation weight (α > 1 with p < 0.01) including 17% (70 / 403) of the neurons with superlinear addition (α δ > 1 with p < 0.01; example: Figs 4, 7C and 7D) and 13% (53 / 403) neurons with sublinear addition (α δ

PLOS COMPUTATIONAL BIOLOGY
Sensory integration in the superior colliculus < 1 with p < 0.01; example: Fig 7E and 7F) when the modulation coefficient (α δ ) is maximized. The distribution of maximum α δ for significantly modulated neurons is shown in Fig  7B. Finding neurons that exhibit superlinear multisensory enhancement is difficult without a well-fit model; naively trying to find superlinear multisensory enhancement would require a significance test for all 144 combinations of the visuoauditory stimuli, and thus would suffer from a multiple comparison correction. For example, we found that only 4 neurons passed this significance test (p < 0.01 / 144). These results demonstrate the difficulty in the detection of multisensory enhancement using generic stimuli and how the model-based analysis overcomes this difficulty by connecting the population-level description of the modulation to an individual neuron's sensory integration.

Discussion
In this study, we present the first evidence of nonlinear multisensory integration (both enhancement and suppression) in the mouse SC, and demonstrate that the nonlinearity of the response depends on early auditory processing that modulates later visual/auditory responses. To achieve this, we used large-scale recording and a model-based analysis to quantify and visualize two principles of sensory integration that have been previously proposed based on studies of individual neurons: the spatial alignment of RFs and the principle of inverse effectiveness [1].

Topographic organization of visual, auditory, and bimodal neurons in the SC
The organization of sensory inputs into topographic maps and the subsequent integration of these maps in associative areas of the brain is a fundamental feature of neural processing. We confirm results from previous studies which show that the visual map of azimuthal space in the sSC is aligned with the auditory map of space in the dSC [17][18][19] and extend this finding to show that most of the visuoauditory bimodal neurons in the mouse SC have overlapping visual and auditory spatial RFs that themselves form a topographic map of space. This supports the model that topographic organization is used to align spatial visual and spatial auditory information perhaps to facilitate the integration of these two modalities.

Multisensory enhancement and inhibition in the mouse SC
We find that the mouse SC contains a significant percentage of stimulus-responsive neurons that exhibit superlinear multisensory enhancement (~17%) and sublinear addition (~13%; Fig 7B). A similar distribution of nonlinearity has been reported in the barn owl tectum [13]. In some previous studies the responses of single neurons are characterized by tailored stimuli (meaning that the time delay between the visual and auditory stimuli were adjusted for each neuron) to increase the yield of multisensory integration. Instead, we performed large scale recordings that presented the same stimuli to all neurons as happens in the real world. This approach requires many different patterns of stimuli and to prevent the data from being sacrificed by a multiple comparison correction, we used a model-based approach. As large-scale recording methods are becoming more common [23][24][25], we believe that similar model based approaches should become the norm instead of using an ad-hoc index (such as multisensory index [26] or orientation selectivity index [16,27]) for each application.

Early auditory processing modulates the visuoauditory response
We present multiple results that suggest that early auditory processing induces visuoauditory integration in the SC: (1) only auditory responses arise in the early (< 20 ms) time window; (2) in the late (> 20 ms) time window, auditory responses are non-monotonic, while visual responses are monotonic; and (3) nonlinearity of the multisensory integration depends on the properties of auditory stimuli, but not visual stimuli. This result is consistent with a study in goldfish Mauthner cells in which the level of multisensory integration depends on the auditory stimulus more strongly than the visual stimulus [28]. The Mauthner cells receive inputs from the optic tectum and are associated with the escape reflex, which the SC is also involved in. However, we did not identify the source of this modulation in this study. Potential sources include local circuitry within the SC [21,29], a pathway with larger latency such as a cortical pathway [30], and/or reciprocal connections with other brain areas such as the parabigeminal nucleus [31][32][33].

Conclusion
The auditory and visual RFs of the neurons in the mouse SC are overlapping and organized as a spatial topographic map and visuoauditory neurons exhibit nonlinear MSI. A model that extends a simple sigmoidal response with an audition-dependent modulation function can explain 76% of the outlier deviance of the simple sigmoidal model. The resulting modulation function is consistent with two important principles of multisensory integration: spatial matching and inverse effectiveness. These results open the door to further studies designed to determine the circuitry underlying multisensory integration, and the mechanisms used for its development.

Ethics statement
The care of animals in this study was carried out in accordance with the University of California, Santa Cruz Institutional Animal Care and Use Committee protocol number Feld2011.

Animal preparation for electrophysiology
Animal preparation, head plate attachment and craniotomy were performed as previously described [17]. We used 2-9 month-old CBA/CaJ (The Jackson Laboratory, 000654) mice of each sex. One to four days before the recording, a custom-made titanium head plate was attached to the mouse's skull. On the day of the recording, the mouse was anesthetized with isoflurane (3% induction, 1.5-2% maintenance; in 100% oxygen) and a craniotomy was made (~1.5-2 mm diameter) in the left hemisphere above the SC (0.6 mm lateral from the midline, on the lambdoid suture). A 256-channel silicon probe (provided by Prof. Masmanidis [25,34]) was inserted through the cortex into the SC with its four shanks aligned along the A-P axis (Fig 1A). The probe was lowered until robust multi-unit visual responses from the superficial SC reached the top of the active region of the most posterior shank. The probe was then retracted by 120-200 μm from its location to reduce drifting during the recording. The mice were euthanized after the recording session.
During the recording, the mouse was allowed to run freely on a cylindrical treadmill made of polystyrene foam. The movement of the cylinder was recorded by a shaft encoder (US Digital, H5-100-NE-S).

Stimulation
Visual stimulation. To measure visual receptive fields (RFs), a 10˚-diameter flashing circular spot on a 16 × 7 grid with 10˚spacing [16] was presented from a computer monitor (Samsung S29E790C; 67.3 cm x 23.4 cm active display size; refresh rate: 60 Hz; gamma corrected) placed 25 cm away from the animal. The stimuli were presented using a custom Matlab program that uses the Psychophysics Toolbox extensions [35][36][37][38]. The 500-ms flashes were either ON (white) or OFF (black) on a gray background at mean luminance (32 cd/m 2 ) and a 500-ms gray screen was inserted after each stimulus presentation. The stimulus was presented in a random order of the contrast and location. Every pattern was repeated 12 times. To estimate the RF parameters, a 2-dimensional Gaussian was fit to the collected neural responses (unbinned maximum-likelihood estimation [16]).
To collect the local field potentials (LFPs) used to estimate the surface location of the SC [16], a contrast-alternating checkerboard stimulus (0.04 cpd square wave alternating at 0.5 Hz) was presented.
Auditory stimulation. Virtual auditory space stimulation for mice was done in an anechoic chamber as previously described [17]. Briefly, sound is produced through earphones that are placed near the mouse's pinnae. The incident direction of the sound was controlled using filters derived from the head-related transfer functions of the mouse.
To measure the auditory receptive fields of the neurons, a full-field white-noise burst was presented from each virtual location in a grid space of 5 elevations (0˚to 80˚with 20˚steps) and 17 azimuths (-144˚to 144˚with 18˚steps), totaling 85 points in a two-dimensional directional space. A maximum-likelihood fit of the Kent distribution was used to determine the spatial properties of the auditory receptive fields [17]. Because the speakers for auditory stimulation blocked a part of the visual field where the azimuth is larger than 90˚, we excluded neurons with RF azimuth larger than 90˚(either visual or auditory) from the analysis.
Multisensory stimulation. To create a multisensory stimulation system the above visual and auditory stimulation conditions were combined with the visual stimulation computer used as a master computer for synchronization. To synchronize the timing of the stimuli, the visual stimulation computer sends a transistor-to-transistor logic (TTL) pulse to the auditory stimulation computer, and the auditory stimulation computer sends a TTL pulse to the data acquisition system (RHD2000, Intan Technologies). Because the entire system did not fit in our anechoic chamber, this experiment was done without an anechoic chamber after verifying that the auditory RFs of the SC neurons are not significantly altered by removing the anechoic chamber.
Visuoauditory spot stimulation was used to measure the properties of multisensory processing. The duration of the visual and auditory stimuli were 100 ms. The auditory stimulus is delivered~13 ms later than the visual stimulus, mimicking the effect of the time delay due to the slow propagation speed of sound in the air (343 m/s; compared to the speed of light). This time delay is consistent with a simultaneous visual and auditory event occurring at 4-5 meters away from the animal. Temporal responses of neurons are calculated relative to the onset of the auditory stimulation. The intensity of the black spots were (0, -0.29, -0.50, -0.65, -0.75) in two of the experiments, and (0, -0.1, -0.2, -0.4, -0.8) in the other 6 experiments, measured as a fractional change from the gray background (for example, -0.1 is 10% reduction of the brightness level, or 28.8 cd/m 2 compared to the 32 cd/m 2 background). The intensity of the auditory stimuli (white noise bursts) were 20, 30, 40, and 50 dB SPL (within 5-80 kHz), or no sound. The stimuli were presented at three locations, 30 degrees apart, that match the visual RF positions of 3 of the silicon probe shanks. We presented black spots because black spots excite more SC neurons than white spots [39]. The total number of the patterns presented to the animal are 5 (visual intensities) × 5 (auditory intensities) × 3 (visual locations) × 3 (auditory locations) = 225 patterns. These patterns were presented in a random order in a 2-s period and repeated 30 times. When either no visual or auditory stimulus is given, this stimulus becomes unimodal, and is used for intensity tuning analysis, as described in the section on "Visual and early auditory responses are monotonic; late auditory responses can be nonmonotonic".

Data analysis
Blind analysis. We used blind analysis, an effective method for reducing false-positive reporting of results [40,41]. Of the 8 recording data sets, 4 were used for exploratory purposes and the other 4 were blinded. Once the features and parameters to be analyzed were decided the same analysis was done on the blinded dataset to confirm if the results agreed with those of the exploratory dataset. All of the results reported in the present study passed a significance test both in the exploratory dataset and the blinded dataset. The combined dataset was used to derive the p-values. Of the individual datasets shown in S3 Fig, data A-D are exploratory datasets and data E-H are blinded datasets.
Spike-sorting. Spikes were sorted using custom-designed software [16,17,42]. Briefly, raw analog signals were high-pass filtered (cutoff~313 Hz, wavelet filter [43]) and thresholded for spike detection. Detected spikes were clustered employing principal component analysis to its waveforms on the seed electrode and its surrounding electrodes. A mixture-of-Gaussians model was used for cluster identification. Clusters with many spikes in the refractory period and duplicated clusters for the same neuron (determined by isolation distance, L-ratio, crosscorrelation analysis and similarity of the waveform shapes) were excluded from the analysis. Neurons with axonal waveforms were identified by principal component analysis and excluded from the analysis [17].
Estimation of the relative position of the neurons across datasets. To make a map of visual/auditory space across datasets from multiple mice, the relative physical locations of the SC neurons were estimated using the visual RFs. The visual RFs on multiple shanks were used to estimate the A-P position where the visual RF azimuth was 0˚, and this point was defined as the zero point of the A-P SC position. The height of the SC surface is estimated by LFPs in response to a checkerboard visual stimulus [16]. Estimating that the insertion angle of the probe into the SC was 25˚ [17], the A-P positions and the depths of the neurons across multiple datasets were calculated and superimposed. Only neurons with a positive A-P position were analyzed in order not to include neurons outside the SC.
Model of multisensory responses. Our model comprises commonly used sigmoidal intensity tuning curves and an auditory-dependent modulation term derived from the analysis of the multisensory responses (AM-model, see Results). We formulated our multisensory model as follows.
For each neuron, the response R to a visuoauditory stimulus is determined by the following equations. and, is the resulting firing rate of a neuron when a visual stimulus is presented at the azimuth ϕ V with the intensity I V and an auditory stimulus is presented at the azimuth ϕ A with the intensity I A ; G(μ, σ) is a Gaussian with a unit height, mean μ and standard deviation σ; T is a sign of the modulation (Exc (excitatory) or Inh (inhibitory)), ϕ eff is the neuron's effective RF azimuth (we used this instead of measured RF azimuths because some neurons do not have a measured RF azimuth); and σ is the width of the modulation effect along the azimuthal axis. The baseline (spontaneous) firing rate b is a fixed value that is measured from all the intervals of the stimuli and is not a model parameter. The descriptions and the ranges of the model parameters are summarized in Table 2. Parameters labelled as "Individual" apply independently to each neuron; Global parameters apply to all the neurons in a given dataset.
To test the model without modulation (L-model), we set α = 1. By setting this value, the number of the global parameters go down from 8 to 0, and the number of the individual parameters go down from 12 to 10 (α and ϕ eff become irrelevant). To test vision-dependent modulation (instead of auditory-dependent modulation; VM-model), we changed the power of the modulation term to the following.
Although the total number of stimulus patterns are 225 as noted above, some of the patterns are redundant. Namely, when either visual or auditory stimulus intensity is zero, its location becomes an irrelevant variable. Thus, the number of non-redundant stimulus patterns are 144 (4 x 4 x 3 x 3; multisensory stimuli) + 12 (4 x 3, no visual stimuli) + 12 (4 x 3, no auditory stimuli) = 168. Therefore, the number of degrees of freedom is 168N − (12N + 8) = 156N − 8, where N is the number of neurons included in the analysis. As the modulation function is not relevant for the L-model, its number of degrees of freedom is 168N − 10N = 158N.
Fitting procedure of the multisensory model. A maximum-likelihood method was used to fit our model to the data. The likelihood function-deviance (D: twice the negative log-likelihood (NLL) ratio)-for one neuron is defined based on quasi-Poisson statistics [17,44,45] as follows.
whereV represents the four stimulus variables {ϕ V , ϕ A , I V , I A }, θ is a dispersion factor defined as yV À � ¼ s obs 2 ðV Þ R obs ðV Þ , R obs is the observed mean firing rate over trials, σ obs is the SEM of the observed firing rate, and R model is the estimated firing rate by the model described above. The sum is over all 168 non-redundant combinations of the stimulus patterns (see previous section). Deviance for the entire dataset is a sum of the individual deviances of all the modeled neurons.
Because of the global parameters that affect all the neurons, the fitting must be done simultaneously for all the neurons. In order to achieve the best accuracy of the modulation function, we included all the neurons that had either visual or auditory responses in the late time window. Since there are tens of responsive neurons per dataset, hundreds of parameters must be simultaneously fit to thousands of data points. We minimized the effort of this challenging fitting procedure by breaking down the function fit into smaller problems with a smaller number of parameters and repeating simple function fits. Namely, when the individual parameters of a neuron are changed, the global parameters are fixed, so the problem is reduced to 12 parameters for the neuron, and when the 8 global parameters are changed, the individual parameters for the neurons are fixed. These fittings were alternately repeated until the overall likelihood value converged (the exact condition is written below).
The fit was performed in two phases because doing the entire fit at once was unstable. For example, if both the neurons' effective RF azimuth (ϕ), and the modulation shapes are changed simultaneously, too many local minima are created and the fit cannot escape from them. By fixing the azimuths of the RF location in the first phase, we could stabilize the resulting properties of the modulation function (and its likelihood value was consistently better). In the second phase, some of the parameter boundaries were also relaxed (see Table 2).
The detailed fitting procedure is the following.
Phase 1: (narrow parameter boundaries) 1. Decide initial visual and auditory sigmoid parameters for individual neurons using unisensory response data.
2. Randomly set the modulation parameters in their allowable ranges (except for A T that are set to a small value (0.1) to stabilize fitting).
3. Refit individual neurons using the full data and modulation.
4. Refit the modulation parameters using the updated individual cell parameters.
5. Repeat 3-4 until the change of NLL in one iteration is below the tolerance value (0.1).
6. Once it reaches the tolerance, perform a detailed error analysis (MINOS [46]) to further improve the fit and repeat 3-4 again.
7. Stop if the likelihood gain is less than the tolerant value after the MINOS analysis.
Phase 2: (relaxed parameter boundaries) Table 2. Summary of the model parameters. 1. Perform the same procedure as 3-7 of Phase 1 with the relaxed parameter boundaries.

Use of the parameters
We repeated this fitting procedure 30 times with random initial modulation parameters, and used the fit that minimized the NLL value. Once the fitting is completed, the error of each parameter was estimated from the Hessian matrix of this likelihood function. To calculate goodness-of-fit of individual neurons, the asymptotic equivalence between χ 2 and two times NLL (Wilks' theorem [47]) was assumed. . These neurons' peak early auditory response is not significantly larger than their max-intensity response (p = 0.25 for neuron 1; p = 0.13 for neuron 2), therefore, these are not considered 'non-monotonic' responses. On the other hand, the peak late-auditory response is significantly larger than the max-intensity response (p = 4 x 10 −10 for neuron 1; p = 0.004 for neuron 2), therefore, these are considered 'non-monotonic' responses.