Figures
Abstract
Animals continuously combine information across sensory modalities and time, and use these combined signals to guide their behaviour. Picture a predator watching their prey sprint and screech through a field. To date, a range of multisensory algorithms have been proposed to model this process including linear and nonlinear fusion, which combine the inputs from multiple sensory channels via either a sum or nonlinear function. However, many multisensory algorithms treat successive observations independently, and so cannot leverage the temporal structure inherent to naturalistic stimuli. To investigate this, we introduce a novel multisensory task in which we provide the same number of task-relevant signals per trial but vary how this information is presented: from many short bursts to a few long sequences. We demonstrate that multisensory algorithms that treat different time steps as independent, perform sub-optimally on this task. However, simply augmenting these algorithms to integrate across sensory channels and short temporal windows allows them to perform surprisingly well, and comparably to fully recurrent neural networks. Overall, our work: highlights the benefits of fusing multisensory information across channels and time, shows that small increases in circuit/model complexity can lead to significant gains in performance, and provides a novel multisensory task for testing the relevance of this in biological systems.
Author summary
We constantly detect sensory inputs, like sights and sounds, and use combinations of these signals to guide our actions. For example, by reading someone’s lips we can better converse with them in a noisy environment. Several mathematical models have been proposed to describe this process. However, these models are “blind” to time. That is, following the example above, if we took the audio and visual signals from our friend and mixed them up over time; current models would not notice any difference, but we would find the result incomprehensible. Motivated by this, we introduce a new set of models which describe how animals could fuse sensory signals across time. Surprisingly, we find that combining signals across senses and short periods of time, works as well as a more complex model.
Citation: Anil S, Goodman DFM, Ghosh M (2025) Fusing multisensory signals across channels and time. PLoS Comput Biol 21(6): e1013125. https://doi.org/10.1371/journal.pcbi.1013125
Editor: Tianming Yang, Chinese Academy of Sciences, CHINA
Received: December 20, 2024; Accepted: May 8, 2025; Published: June 6, 2025
Copyright: © 2025 Anil et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Example code for generating all tasks can be found in S1 Appendix. Full code for all tasks and models can be found at: https://github.com/swathianil/Temporal_Nonlinear_fusion.
Funding: SA is supported by the Landesgraduiertenförderung Abschlussstipendium, issued by the Graduate Funding of the Land of Baden-Württemberg (LGFG). MG is supported by the Eric and Wendy Schmidt AI in Science Postdoctoral Fellowship, a Schmidt Sciences program. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Key points
- We introduce a novel multisensory task in which we provide task relevant evidence via bursts of varying duration, amidst a noisy background.
- Prior multisensory algorithms perform sub-optimally on this task, as they cannot leverage temporal structure.
- However, they can perform better by integrating across sensory channels and short temporal windows.
- Surprisingly, this allows for comparable performance to fully recurrent neural networks, while using less than one tenth the number of parameters.
Introduction
Picture a predator trying to track prey through a dense field. How should they approach this challenge? One solution would be to rely solely on either visual or auditory cues, such as sightings of, or screeches from, their prey. However, these unisensory strategies will be sub-optimal in many situations, like poor lighting conditions or noisy environments. Consequently, many animals combine information across their senses and base their decisions on these merged signals: a fundamental process termed multisensory integration [1, 2].
To date, numerous algorithms have been proposed to describe how animals implement this process [3]. For example, n-look algorithms suggest that observers examine the inputs from n channels, but form their “multisensory” output using only one - which could be the channel with the strongest or fastest signal [4, 5]. In contrast, fusion algorithms form their outputs by combining their inputs across sensory channels either linearly [6–9] or nonlinearly [3, 10, 11]. These algorithms can all be interpreted as instantaneous input-output mappings, or coupled to drift-diffusion models and used to explore how observers integrate multisensory evidence over time. For example, how animals determine their heading direction from visual and vestibular cues [6, 8]. However, in general, these algorithms treat successive observations independently. Meaning they are unable to leverage the temporal structure inherent to naturalistic signals.
In contrast, experiments using visual [12], auditory [13] and multisensory stimuli [14–17] have demonstrated that our perception at any given moment is strongly influenced by our recent observations - a phenomenon termed serial dependence. For example, when observers are presented with a sequence of orientated Gabor’s and asked to report the orientation of each; their responses will be accurate - on average across the experiment, but systematically biased by recent trials - on a trial-by-trial basis [18]. This phenomenon, has been framed as being both advantageous - as integrating information over time will improve signal-to-noise, and disadvantageous - as recent stimuli could render an instantaneous response sub-optimal [19]. However, much of this research has focused on trial-trial dependence, rather than moment-to-moment changes in dependence within a trial.
Here, we explore moment-to-moment multisensory integration in three steps. First, we introduce a novel multisensory task in which we provide the same number of task relevant stimuli per trial (in a background of noise), but vary how this information is presented: from many short bursts to a few long sequences. Next, we demonstrate that prior multisensory algorithms perform sub-optimally on this task, though perform better by simply considering short temporal windows. Finally, we explore the more naturalistic case, in which information is structured at multiple timescales. Specifically, we sample burst lengths from a Lévy distribution. This distribution, describes animal behaviours such as foraging [20], which are composed of many short bursts (local exploitation) interspersed by occasional long flights (exploration).
Results
Multisensory algorithms are blind to temporal structure
We previously introduced a family of multisensory tasks in which observers must track prey using sequences of multisensory signals over time (Fig 1A) [11]. Conceptually, in these tasks, prey either hide or emit signals, which provide clues about their direction of motion, at every time step and observers must estimate their direction of motion (e.g. left or right). Practically, in each trial, prey are assigned a direction of motion. Then, at each time point, a hidden variable (e) takes values 0 or 1. When e = 0 (prey hiding), the predator observes noise (signals with no bias in either direction) in each sensory channel. When e = 1 (prey observable), the predator receives signals which indicate the prey’s direction of motion. As such, while each time step provides little information on it’s own, observers can improve their accuracy by accumulating evidence over time. However, like many prior multisensory tasks each time step is independent; meaning that consecutive sequences of signals are no more informative than those same signals spaced over time.
In the multisensory detection task from [11] (A), observers must estimate their prey’s motion (left/right) from sequences of time-independent signals. Here, we introduce a variant of this task (B) in which, prey emit either short bursts or long sequences of signals controlled by a burst length parameter k. Notably, as k increases we decrease the number of bursts, such that the total number of time steps where the signal is present is constant across trials (on average). For example, here when k = 2 we provide 4 bursts, while when k = 4 we provide only 2. In classical multisensory algorithms the information from independent sensory channels (Ch0, Ch1) is combined via a linear sum (C) or nonlinear function (D) then summed over time. To capture temporal structure we adapted the nonlinear fusion algorithm to process sliding input windows of length w (E). We compare these models to fully recurrent neural networks (F). G An example trial showing whether or not the target was emitting a signal (top row, variable E), the sensory signals in the two channels (middle two rows), and the corresponding estimated probability that the target is moving right (M = R), as estimated by a model, in the bottom row. As evidence accumulates the model gains confidence in its prediction.
Here, we introduce a new multisensory task with two key properties:
- Prey emit bursts of cues which are informative of their direction of motion. The length of these bursts is set by a parameter we term k. Low k generates short bursts, while high k generates long sequences (Fig 1B). When k = 1 there is no serial dependence and each time step is independent.
- As k increases we decrease the number of bursts, such that the total number of time steps where the signal is present, or signal sparsity, is constant across trials (on average). For example, for a trial of a given length and signal sparsity, we would provide n bursts for a burst length of 4, 2n bursts for a burst length of 2 and 4n bursts for a burst length of 1.
Together, these properties generate trials in which the total duration of the signal is constant, but how this information is presented varies: from many short bursts to a few long sequences. Notably, as the time steps in this task are not independent, calculating optimal performance, in the Bayesian sense, is computationally intractable [21] and so we cannot easily compute an upper bound on performance.
We began by training and testing two algorithms (linear and nonlinear fusion LF, NLF) on this task as we varied the burst length k from 1 to 8 (Fig 1C and 1D). When we allow these algorithms access to only one channel (unisensory) both perform well (Fig 2A). Though, they perform significantly better with access to both channels (multisensory) demonstrating the benefit of fusing information across channels in this task. Consistent with prior work [11] NLF outperformed LF in the multisensory case. At lower levels of signal sparsity, this difference would increase substantially [11], though here our focus is on temporal structure, not linear vs nonlinear fusion. Notably, both algorithms’ accuracies decrease slightly as a function of k, despite the additional temporal information available, and consistent signal sparsity. The reason for this is that, while the average number of signal events remains constant across k, the distribution of the number of signal events, in a given trial, has some variability. This is explained in more detail in S1 Appendix.
A The accuracy of each model (y-axis) when trained and tested on the same value of the signal burst length k (x-axis); i.e. when tested in distribution. We trained and tested 5 RNN models, and plot their mean accuracy standard deviation.
The test accuracy of each model when trained on a specific value of k (colours) and tested on another (x-axis), i.e. when tested out of distribution.
The accuracy of the algorithm trained on k and tested on k −
(where
is the x-axis), relative to the best achievable accuracy by that model on that test set. More precisely, if
is the accuracy when trained on
and tested on
, then the plot shows
. When
, each model is being tested on longer sequences than it was trained on. When positive, models are being tested on sequences shorter than they were trained on.
Overall, these results illustrate the intuitive point that, as often implemented, linear and nonlinear fusion across channels cannot leverage temporal structure.
Incorporating time into multisensory algorithms
To capture temporal information, we adapted the nonlinear fusion algorithm to process sliding input windows of length w; a family of models we term NLFw (Fig 1E). As such, at each time step (t), these models combine incoming (t) and prior signals (from t–w). As the number of parameters in these models scales unfavourably (), we focus on NLF2 and NLF3. To capture dependencies over longer timescales, we also trained recurrent neural network models (RNNs). Our RNN models used nonlinear activations (ReLU) and were trained with backpropagation through time. Overall, our models have the following numbers of learnable parameters: LF - 6, NLF - 9, NLF2 - 81, NLF3 - 729, RNN - 10,903 (Table 1). Notably, a linear algorithm with a sliding window is equivalent to one without (as the order of operations does not matter), so there is no need to test these models.
A table noting: each model’s parameter scaling, each model’s number of learnable parameters here, and how each model combines information across sensory channels (linear or nonlinear) and time. Linear and nonlinear fusion (LF and NLF), treat each timestep independently. NLF2/3 fuse information across short temporal windows. RNNs combine incoming signals with their prior hidden states. For the parameter scaling functions: O - the number of possible observations, Nc - the number of sensory channels, w - the temporal integration window. For our RNN models the number of parameters scales as: , where
denote the number of input, hidden and output units and Nb is the number of bias parameters per hidden unit.
In distribution.
We, first considered how well these models performed in distribution. That is, when each model is trained and tested on bursts of the same length (k): train on k = 2, test on k = 2, etc. (Fig 2A).
When k = 1, NLF2/3 and the RNN models all perform equivalently to NLF, as each time step is independent and there is no additional temporal structure to detect. However, as k increases these models outperform LF and NLF as they are able to leverage the additional temporal information available. Though, how each model’s accuracy varies with k differs. NLF2/3 increase their performance then plateau. While, RNN performance is flat then rises. As such, while, the RNNs excel at detecting longer sequences, the simpler NLFw models are better at detecting shorter bursts and surprisingly good at longer sequences, even when those sequences are longer than their window length (or memory). This is likely due to the fact that, even when k is high, short bursts are sufficiently informative of the prey’s direction of motion; as they are unlikely to occur in the incorrect direction. Moreover, when we trained smaller RNNs, with an equivalent number of trainable parameters to NLF3, we found that they performed far worse, suggesting that NLFw may be a better model per parameter (S2 Fig). In S3 Fig, we visualise the learned parameters for models as a function of k.
Notably, unisensory versions of these models performed worse than their multisensory equivalents (Fig 2A) and showed less difference in their performance as a function of k (S4 Fig).
Together, these results demonstrate the benefits of fusing information over channels and time. However, in naturalistic settings, predators must perform well not only in response to motion patterns they have experienced, but also, in novel, unseen situations.
Generalisation.
We next considered how well these models generalise. That is how well they perform when they are trained on one burst length (k) and tested on another (Fig 2B, 2C and 2D).
When fit on k = 1 and tested on k>1, all three model’s accuracies decrease slightly as a function of k (dark blue lines, Fig 2B1, 2C1 and 2D1). This reflects the fact that while these models have the capacity to detect sequences, there is no benefit learning to do so when your training data has no temporal structure (k = 1). As such, they perform equivalently to NLF (Fig 2A). Similarly, when NLF3 is trained on k = 2 it only performs as well as NLF2; and perhaps even slightly worse (Fig 2C1), as there is no benefit learning to detect longer sequences.
In contrast, as these models learn from longer sequences, past w in the case of NLF2/3, all three models generalise reasonably well; the maximum difference in accuracy we observe is less than (Fig 2B2, 2C2 and 2D2). Though, again, we observe a notable difference between the NLFw and RNN models. Specifically, both NLF models generalise better when tested on longer sequences than they were trained on, and less well to shorter sequences. In contrast, the RNN generalises better to shorter rather than longer sequences (Fig 2B2, 2C2 and 2D2).
Overall these results, demonstrate that all three models generalise well when tested out of distribution. However, these scenarios are still unrealistic, in the sense that the prey always emit bursts of similar length.
Capturing multi-timescale structure
To add further realism to our task, we next considered a variant in which, within each trial, prey emit bursts of varied length; drawn from either a uniform or Lévy distribution. The latter is a heavy-tailed distribution, which we chose to explore, as it describes animal behaviours like foraging [20] which are composed of many short bursts, broken by occasional long flights (which constitute the heavy tail).
We found that all three models (NLF2, NLF3 and RNNs) performed well when trained and tested on burst lengths drawn from either uniform or Lévy distributions. Furthermore, models trained on our fixed length tasks generalised well to these mixed length tasks (Fig 3A, 3B and 3C).
Model performance (test accuracy, y-axis) on mixed (uniform and Lévy) and fixed burst length tasks (x-axis). A Nonlinear fusion with window length 2 (NLF2), B Nonlinear fusion with window length 3 (NLF3), C Recurrent neural networks (RNN). In each sub-panel, the solid/dashed gray lines show how well the model performs when it is trained on mixed burst lengths (Lévy/uniform), and then tested on mixed or fixed burst length tasks (across the x-axis). Each shaded region shows the min-max span across all models trained on fixed burst lengths (from k = 1 − 8). We calculated this by: training models for each value of k, testing them on all bursts (mixed and fixed), then selecting the best and worst model per scenario.
However, NLFw models trained on mixed distributions tended to generalise better than those trained on fixed distributions (Fig 3A and 3B). Though, the same was not true for the RNNs (Fig 3C). For example, at k = 8, the NLFw models trained on mixed distributions outperformed those trained on fixed distributions (Fig 3A and 3B). While, the RNNs trained on fixed distributions outperformed their counterparts trained on mixed distributions (Fig 3C). This suggests that learning to detect a mixture of short bursts, in the case of NLF2/3, yields a parsimonious strategy that generalises well to longer sequences. While, with more resources (i.e. parameters) the RNNs can learn more specialised strategies (for each value of k).
Together these results demonstrate that in a more complex situation, where prey emit signals in multiple channels and at multiple time-scales, all three models perform reasonably well. However, across settings (testing in and out of distribution, on both fixed and mixed length signals), NLF3 often performs best or close to best (Fig 4). Underscoring the benefit of fusing information, not only over channels, but also over short temporal windows.
Performance of NLF2, NLF3 and RNN across various training and testing conditions; in distribution (on the diagonal) and out of distribution (off diagonal). We train and test on 10 different distributions: Lévy flights, uniformly distributed burst lengths, and fixed burst lengths ().
Discussion
To date, many psychophysical tasks have been used to explore how animals combine information across their senses. In classical multisensory tasks, each sensory modality (or channel) provides evidence about an underlying target independently [6–9, 22]. As such, the optimal solution to these tasks is to linearly fuse (or integrate) evidence across channels. Though, in the case when channels are co-dependent the optimal solution is to nonlinearly fuse evidence across channels [11]. This case seems likely to arise in natural conditions; consider the relation between lip movements and sounds, for example. However, in both of these cases evidence should be fused linearly across time. This is because within these tasks, each time step is independent, meaning the time points within each trial could be shuffled without changing the results.
Here, we explored another scenario that seems likely to arise in natural conditions, where the evidence at any given moment depends on recent moments: like the bursts of sensory signals prey darting from cover to cover would emit. To do so, we adapted a task from [11] to make the target (prey) emit bursts of varied length; thereby introducing a sequential time dependence. From the point of view of estimating the target value (the predator’s perspective), this means that if you have recently seen evidence suggesting that the target is currently emitting a signal, you may wish to increase the relative weighting of the current evidence.
This intuition suggested two simple models which could perform this computation, and plausibly be implemented by neural circuits. The first combines information across channels and short temporal windows of a fixed length; a family of models we term nonlinear fusion over w time steps, or NLFw. The second are recurrent neural networks (RNNs) which combine the current evidence from each channel with their prior hidden states, and so can capture structure at multiple timescales. Notably, the Bayes optimal solution to this problem involves a combinatorial computation over all time steps [21], which is computationally intractable and unlikely to be implemented by neural circuits.
As expected, both models (NLFw and RNNs) were able to leverage temporal structure, and outperform time-independent algorithms (linear and nonlinear fusion). Though, across settings (in and out of distribution, and on both fixed and mixed length signals) NLF2/3 performed well, and were surprisingly comparable to the RNNs despite having less than one tenth the number of parameters. This suggests that in these tasks, short bursts are sufficiently informative of the target’s motion and there is little benefit to detecting longer sequences. In practice, this is akin to change detection, and would also allow for faster reaction times than waiting to observe longer sequences.
So, which is a more plausible model of how neural circuits integrate information over channels and time? NLFw could be implemented by a range of simple mechanisms. For example, a multisensory neuron receiving w inputs per channel, each offset with a different temporal delay would fuse information across channels and time. In contrast, the RNN model requires a population of densely connected units, and hence a higher energetic cost. For the tasks we explore here, this cost does not seem merited by the increase in performance. However, in tasks with even more detailed temporal structure, for example, complex multistep, multisensory sequences, it seems likely that a recurrent network would outperform the simpler model.
In conclusion, our results demonstrate the benefits of combining multisensory information across short temporal windows (NLFw) or prior states (RNNs). Either of which could easily be implemented by neural circuits. More broadly, our results underscore the importance of exploring more complex multisensory tasks, and highlights the fact that, despite their apparent complexity, these tasks are often solvable with simple, biologically plausible extensions to existing models.
Materials and methods
In short, we build on the detection task from [11] - in which observers must estimate the heading direction of a target from sequences of information in n channels; which represent different sensory modalities or independent sources of information from the same modality. In this task, at some (unknown) time points, the target emits signals and the sensory observations provide information about the target direction, while at other times, the sensor receives noise. In [11] we showed that this task structure requires nonlinear fusion across channels to solve optimally, but since each time step is independent, only linear fusion across time steps was needed. Here we introduce limited temporal dependency in the signals by having signals switch on at unknown times but then remain on for a number of steps (where this number is either a constant or chosen from a random distribution). It is not feasible to compute the optimal classifier in this case, and so we investigated two classifiers with different types of short-term memory, one based on a linear sum of nonlinear functions operating on a window that slides across all time steps, and one based on a recurrent neural network. Below, we detail these tasks and inference models.
Tasks
We start by sampling a random direction with equal probability. The task is to estimate M.
We define a sequence of binary valued random variables Et to indicate whether the signal is emitting a signal at time t (where ). When Et = 1 the distribution of the observations Xit in channel i follow a signal distribution (giving the correct value of M with probability pc, incorrect value with probability pi and a neutral value (0) with probability
). When Et = 0, Xit follows a noise distribution taking the correct/incorrect values of M with equal probability (1−pn)/2 and the neutral value with probability pn.
To generate the sequence Et, we sample a generating sequence Gt of signal start times, so that whenever Gt = 1 the signal Et = 1 and will stay 1 for some period Lt (which can either be a constant or be drawn from a distribution). Finally, to ensure that the task is solvable, we filter out all cases where Et = 0 for all t. We set the probability that Gt = 1 so that the fraction of the time that Et = 1 is equal to a value pe that we choose.
General task structure.
In more detail, the task variables are related by the following graphical model:
where
is the target direction, with each of the two values being equally likely.
are the start times for emission periods, taking value 1 with probability pg to indicate the start of an emission period.
is the length of the emission period starting at t, and follows a different distribution for different tasks.
is whether or not the target is emitting a signal at time t, and has deterministic dependence on Lt and
.
is the signal received in channel
at time t. It’s distribution depends on M and Et as described below.
- The Gt are independent variables except that if
then values are resampled, introducing a small time dependence that we ignore in the analysis.
The distribution of Xit depends on whether or not there is a signal being emitted or not. If not, it follows a noise distribution that is independent of M:
If a signal is being emitted, then it depends on M as follows:
Normalising for signal sparsity
In order to make performance comparable between the different task variants, we normalise certain parameters so that the average amount of useful information (ignoring time) is the same across tasks. Specifically, we choose pg so that the expected fraction of the time that a signal is being emitted is equal to a value pe that we choose. We outline the calculation of this normalisation for the
and Lévy flight tasks below.
Detection task with k timesteps (Detk).
In this task, we sample values with
independently. We compute
and then resample if all
. Note that to compute E1 we need to have a value for G1−k + 1 (and onwards to Gn). We start by computing P(Et = 1) without the resampling procedure, and then compute the effect of resampling.
Without resampling,
Each of these values Gt−k + 1 to Gt are independent, and therefore we can write
We can then compute
Now write F for the event that that we filter out. We want to compute the fraction of the time Et = 1 when this event does not take place,
. To calculate this, we note that either F does or does not take place, so computing the total probability over both these two possibilities we get
We computed P(Et = 1) above and by the definition of F we know that , and therefore we get that
It remains to calculate
This gives us
We can then numerically invert this to get the correct value for pg given a desired value pe. Note that by taking limits as , the smallest pe achievable is
.
Lévy flight.
The calculation for the Lévy flight task is similar to the task but a little more involved.
To simplify notation later, write
and
To compute P(Et = 0) we need to consider various possibilities depending on the on lengths Ls for . For Et = 0, the first thing that needs to be true is that Gt = 0 (because if Gt = 1 then automatically Et = 1). The next condition we need to be true is that either Gt−1 = 0 or Gt−1 = 1 but Lt−1 = 1 meaning that the on period generated by Gt−1 = 1 was only of length 1 and therefore did not cause Et = 1. And so on until we have gone back to the maximum number of previous steps
that could cause Et = 1. Each of these events depend on Gs and Ls for a different value of s, and are therefore independent, so
This gives us P(Et = 1) without filtering, and to compute we need to compute P(F). We break this event down into independent events
defined as Gt = 0 for
whenever
. In other words, there can be no length-1 on periods that cause an Et = 1 (event F1), no length-2 on periods (event F2) and so on. These are each independent events and therefore
Putting it together,
As before, this can be inverted numerically to compute the pg value that gives a desired pe value.
Task parameters
For all tasks we used the following parameters: pc = 0.45, pi = 0.01, pn = 0.33 and pe = 0.04 which results in the following values of pg for each value of k (from 1-8): 0.039, 0.02, 0.013, 0.01, 0.008, 0.006, 0.005, 0.004, and values of 0.022 / 0.011 for our mixed k tasks (Lévy/uniform).
Inference
The task is to estimate M from Xit with unknown hidden variables Gt, Et and Lt. It is straightforward to write down the optimal maximum a posteriori (MAP) estimator using Bayes’ theorem. We simply want to choose m that maximises , and given
and
doesn’t depend on m, we just need to maximise
. We can marginalise over the hidden variables to get
However, given that the time steps are not independent, and the sum is over terms, it is hard to simplify this any further in a way that could be easily computable by the brain.
Instead, we consider two types of estimators with different types of short term memory.
.
The first type, w-step nonlinear fusion (or ) computes, for each time step, a likelihood based on the previous w time steps (which can be implemented as a table look up over all the possible observations Xit over w time steps), and then sums this over all time steps. In more detail, we enumerate every possible observation Xit for
(of which there are
possibilities). For example, for Nc = 2, w = 2, the 81 observations are
,
,
. For every trial, we then then count how often each of these observations occurs for every window
, and make these counts into a feature vector
. For example, if
then the count for the observations
is 2, and the count for the observation
is 1, and all other observations have a count of 0. We then train a logistic regression classifier to estimate M from F using scikit-learn [23]. The parameters of the linear part of the model can be interpreted as the log-likelihoods of each possible observation in a window, and using counts is equivalent to summing these over all windows.
RNN.
The second type of estimator, RNN, uses a recurrent neural network with one hidden layer and 100 hidden units, fed at each time step with inputs Xit and its previous state , and the output of this network is treated as a likelihood. In more detail, write
for the activity of the hidden layer at time step t, and
for the activity of the output layer.
for nonlinear activation functions f, g, weight matrices Wih, Whh and Who, and biases and
. The rectified linear unit function is defined as:
The hidden state is initialized to zero, and outputs are summed across time steps to produce the final prediction
. The network is trained with a standard cross-entropy loss on the output layer using PyTorch [24] and the Adam optimiser [25] with learning rate of 10−6.
Code
Example code for generating all tasks can be found in S1 Appendix. Full code for all tasks and models can be found at: https://github.com/swathianil/Temporal_Nonlinear_fusion.
Supporting information
S1 Appendix. Why does task performance decrease with burst length? And task code.
https://doi.org/10.1371/journal.pcbi.1013125.s001
(PDF)
S1 Fig. Why does task performance decrease with burst length?
Explanation of why accuracy decreases for methods that do not use temporal information (LF, NLF) as k increases. See S1 Appendix.
https://doi.org/10.1371/journal.pcbi.1013125.s002
(TIFF)
S2 Fig. Performance of algorithms compared to a smaller RNN.
The accuracy of each model (y-axis) when trained and tested on the same value of the signal burst length k (x-axis); i.e. when tested in distribution. For the RNN models error bars show standard deviation over 5 separate train/test runs. For comparison, we include the performance of a scaled down RNN (RNNsmall), with 22 hidden units and 663 trainable parameters—a comparable number of parameters to NLF3 (729 parameters).
https://doi.org/10.1371/journal.pcbi.1013125.s003
(TIFF)
S3 Fig. Parameters of trained algorithms.
Learned parameter weights. Each subplot shows the weights learned by a single model (LF, NLF, NLF2 and NLF3). Each x-axis shows the value of k used to train the model. Each y-axis shows possible observations / features. For example ; sorted by their total corresponding parameter value (summed across values of k). Intuitively, ambiguous signals tend to be assigned low/zero weights. For example,
. While, unambigous signals, tend to be assigned larger weights. For example,
. And NLFw seems to learn different patterns when k<w vs when
. Though we did not quantify these observations.
https://doi.org/10.1371/journal.pcbi.1013125.s004
(TIFF)
S4 Fig. Comparison of multi- and unisensory task performances.
Scatter plot showing the performance of algorithms in unisensory (x axis) and multisensory (y axis) tasks. Accuracies are shown relative to the accuracy of the corresponding performance of the nonlinear fusion (NLF) algorithm. Lighter colours correspond to lower values of the burst length k, and darker values to higher values.
https://doi.org/10.1371/journal.pcbi.1013125.s005
(TIFF)
References
- 1.
Trommershauser J, Kording K, Landy MS. Sensory cue integration. Oxford University Press; 2011.
- 2. Fetsch CR, DeAngelis GC, Angelaki DE. Bridging the gap between theories of sensory cue integration and the physiology of multisensory neurons. Nat Rev Neurosci. 2013;14(6):429–42. pmid:23686172
- 3. Jones PR. A tutorial on cue combination and signal detection theory: using changes in sensitivity to evaluate how observers integrate sensory information. J Math Psychol. 2016;73:117–39.
- 4. Townsend JT, Wenger MJ. A theory of interactive parallel processing: new capacity measures and predictions for a response time inequality series. Psychol Rev. 2004;111(4):1003–35. pmid:15482071
- 5. Otto TU, Mamassian P. Noise and correlations in parallel perceptual decision making. Curr Biol. 2012;22(15):1391–6. pmid:22771043
- 6. Fetsch CR, Pouget A, DeAngelis GC, Angelaki DE. Neural correlates of reliability-based cue weighting during multisensory integration. Nat Neurosci. 2011;15(1):146–54. pmid:22101645
- 7. Drugowitsch J, DeAngelis GC, Klier EM, Angelaki DE, Pouget A. Optimal multisensory decision-making in a reaction-time task. Elife. 2014;3:e03005. pmid:24929965
- 8. Hou H, Zheng Q, Zhao Y, Pouget A, Gu Y. Neural correlates of optimal multisensory decision making under time-varying reliabilities with an invariant linear probabilistic population code. Neuron. 2019;104(5):1010.e10--1021.e10. pmid:31607423
- 9. Coen P, Sit TPH, Wells MJ, Carandini M, Harris KD. Mouse frontal cortex mediates additive multisensory decisions. Neuron. 2023;111(15):2432.e13--2447.e13. pmid:37295419
- 10. Parise CV, Ernst MO. Correlation detection as a general mechanism for multisensory integration. Nat Commun. 2016;7:11543. pmid:27265526
- 11. Ghosh M, Béna G, Bormuth V, Goodman DFM. Nonlinear fusion is optimal for a wide class of multisensory tasks. PLoS Comput Biol. 2024;20(7):e1012246. pmid:38968324
- 12. Pascucci D, Tanrikulu ÖD, Ozkirli A, Houborg C, Ceylan G, Zerr P, et al. Serial dependence in visual perception: a review. J Vis. 2023;23(1):9. pmid:36648418
- 13. Dyson BJ. Serial dependence in audition: Free, fast, and featureless? Trends Cogn Sci. 2017;21(11):819–20.
- 14. Wozny DR, Shams L. Recalibration of auditory space following milliseconds of cross-modal discrepancy. J Neurosci. 2011;31(12):4607–12. pmid:21430160
- 15. Kayser SJ, Kayser C. Trial by trial dependencies in multisensory perception and their correlates in dynamic brain activity. Sci Rep. 2018;8(1):3742. pmid:29487374
- 16. Lau WK, Maus GW. Visual serial dependence in an audiovisual stimulus. J Vis. 2019;19(13):20. pmid:31770771
- 17. Shaw LH, Freedman EG, Crosse MJ, Nicholas E, Chen AM, Braiman MS, et al. Operating in a multisensory context: assessing the interplay between multisensory reaction time facilitation and inter-sensory task-switching effects. Neuroscience. 2020;436:122–35. pmid:32325100
- 18. Fischer J, Whitney D. Serial dependence in visual perception. Nat Neurosci. 2014;17(5):738–43. pmid:24686785
- 19. Kiyonaga A, Scimeca JM, Bliss DP, Whitney D. Serial dependence across perception, attention, and memory. Trends Cogn Sci. 2017;21(7):493–7. pmid:28549826
- 20. Viswanathan GM, Buldyrev SV, Havlin S, da Luz MG, Raposo EP, Stanley HE. Optimizing the success of random searches. Nature. 1999;401(6756):911–4. pmid:10553906
- 21.
Ma WJ, Kording KP, Goldreich D. Bayesian models of perception and action: An introduction. MIT Press; 2023.
- 22. Farahmandi A, Abedi Khoozani P, Blohm G. Multisensory integration across reference frames with additive feed-forward networks. bioRxiv, preprint, 2024:2024–11.
- 23. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12(Oct):2825–30.
- 24.
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G. PyTorch: an imperative style, high-performance deep learning library. Red Hook, NY, USA: Curran Associates Inc.; 2019.
- 25.
Kingma D, Ba J. Adam: a method for stochastic optimization. In: International Conference on Learning Representations. 2014.