Bayesian inference is facilitated by modular neural networks with different time scales

Various animals, including humans, have been suggested to perform Bayesian inferences to handle noisy, time-varying external information. In performing Bayesian inference by the brain, the prior distribution must be acquired and represented by sampling noisy external inputs. However, the mechanism by which neural activities represent such distributions has not yet been elucidated. Our findings reveal that networks with modular structures, composed of fast and slow modules, are adept at representing this prior distribution, enabling more accurate Bayesian inferences. Specifically, the modular network that consists of a main module connected with input and output layers and a sub-module with slower neural activity connected only with the main module outperformed networks with uniform time scales. Prior information was represented specifically by the slow sub-module, which could integrate observed signals over an appropriate period and represent input means and variances. Accordingly, the neural network could effectively predict the time-varying inputs. Furthermore, by training the time scales of neurons starting from networks with uniform time scales and without modular structure, the above slow-fast modular network structure and the division of roles in which prior knowledge is selectively represented in the slow sub-modules spontaneously emerged. These results explain how the prior distribution for Bayesian inference is represented in the brain, provide insight into the relevance of modular structure with time scale hierarchy to information processing, and elucidate the significance of brain areas with slower time scales.


Introduction
In the human and various animal brain, information processing involves inference based on inputs from the external world through the sensory systems, which obtains information with uncertainty due to noise.Previous studies suggested that animals such as humans and monkeys process inputs according to a Bayesian inference framework to deal with such uncertainty [Knill and Pouget, 2004, Angelaki et al., 2009, Haefner et al., 2016, Ernst and Banks, 2002, Friston, 2012, Merfeld et al., 1999, Doya et al., 2007, Pouget et al., 2013, Beck et al., 2011, Geisler and Kersten, 2002, Honig et al., 2020].
Bayesian inference is performed by calculating the posterior from the prior, which refers to the information possessed in advance about the signal, and the likelihood estimated by observing the input signal.Hence, it is believed that the prior must first be represented in the brain, but how prior information is shaped in the brain remains unclear.In previous studies, the prior has often been treated as a given value [Echeveste et al., 2020], and the mechanism for shaping the prior by learning has not been considered.Evolutionary acquisition of the prior has been proposed [Campbell, 2016, Lo andZhang, 2021], whereas it is naturally expected that such information should be shaped within one generation through observing and learning time-dependent signals.Experimental results suggest that the prior and the likelihood for Bayesian inference are encoded in different brain areas [Vilares et al., 2012, Chan et al., 2016, d'Acremont et al., 2013].Still, the validity and the mechanisms underlying the results remain controversial, and how area differentiation is relevant to the accuracy of Bayesian inference is not well understood.A simulation [Quax et al., 2021] suggested that a gain of the activation function encodes the prior.However, because the prior was fixed in this study, how shaping occurs when the prior varies over time was not considered.
In general, to obtain the prior, it is necessary to estimate the prior distribution based on previous observations, and the population of neurons that represents the prior must integrate observed inputs over time.One possible mechanism for achieving such integration may be two neural modules functioning at distinct time scales: a downstream neuron population with slower activity changes separated from an upstream neuron population that processes input information.In this structure, the slow module that does not directly receive inputs may facilitate integration.Some experimental reports have suggested that the time scale of neural activities in downstream areas of the brain that do not directly receive external input is slow [Murray et al., 2014, Cavanagh et al., 2020, Golesorkhi et al., 2021].On this basis, we evaluated recurrent neural networks (RNNs) with two modules; a main module with direct connection to the input-output layer and a sub-module with a direct connection to the main module and no connections to the input-output layer (i.e., a hierarchical structure)(Fig.1).Then, we examined the role of modular structure and the relevance of the time scale difference between the main and sub-modules for the prior representation for Bayesian inference.
We found that RNNs with a modular structure shape the prior more appropriately than regular RNNs.Further, Bayesian inference is more accurate when the time scale of the sub-module is appropriately slow.When the time scale is uniform, prior information is maintained in both the main module and sub-module.In contrast, when the time scales are different, prior information is represented by the slow sub-module.Comparing these two cases revealed that the coded variance of prior on the neural manifold was easier to decode in the time scale difference model, which facilitated the distinction of the average input change from noise.
In addition, we examined if the modular structure with distinct time scales would emerge from a homogeneous neural network.We trained the network in a Bayesian inference task where the time scale of each neuron varied in time.As the training progressed, we observed that the time scales of neurons differentiated into slower and faster scales.A modular structure arose in which slow neurons were separated from the input/output layers, which were predominantly connected to the fast neurons, and a sub-module with slow neurons represented the prior information.
These results are crucial for understanding the prior representation mechanism in Bayesian inference and provide insight into the relationships between neural network structure, neural dynamics [Amunts et al., 2022, Mastrogiuseppe and Ostojic, 2018, Vyas et al., 2020, Beiran et al., 2021], and time scales [Papo, 2013] underlying information processing in the brain, which is considered the central issue of computational neuroscience.

Recurrent Neural Networks with/without modular structure
To investigate the effect of structure and time scale on Bayesian inference, we considered the following RNNs [Barak, 2017].
First, we established a regular RNN consisting of an input layer, a recurrent(hidden) layer, and an output layer, as shown in Fig. 1(a).The following equation represents the dynamics of the recurrent layer: where α = (α 1 , α 2 , ..., α 200 ) T represents a vector to introduce the time scale of the neurons as where the standard homogeneous network is given by α s = α m ; the case with α s < α m was also studied to investigate the effect of time scale difference.Although we mainly studied the systems with 150 fast, and 50 slow neurons, the results to be discussed are not altered, as long as both the numbers are sufficient (say 100 vs 50, 150 vs 150 for fast and slow neurons).Here, u(t) is the input signal, and x is the state of the neurons in the recurrent layer.We adopted the  [Nair and Hinton, 2010].Then, the output of the RNN was determined by the linear combination of the internal states as follows.
(3) In Eq.1, ξ was used to account for noise in dynamics given by a random variable that follows a normal distribution with mean 0 and standard deviation 0.05.
Next, we introduced a modular structure to the above RNN to ensure the distinction of main and sub-modules(Fig.1(b)).Only the main module was connected to the input/output layers.Thus, the dynamics of the recurrent layer are given by where x m andx s represent the firing rate of neurons in the main and sub-modules, respectively.Here, α m andα s represent the time scale of the main and the sub-module, respectively.α m is fixed at 1, while we varied α s from 1 to 0.01 to examine the effect of the time scale difference.The RNN output was determined by the linear combination of internal states of the main module.

Task
In this study, we considered a task in which Bayesian inference improves estimation accuracy.Specifically, the RNN was tasked with estimating the true value from an observed signal with noise.We generated the external input as follows: First, the true value y true was randomly sampled from a generator(cause) distribution, given by the normal distribution with mean µ g and variance σ 2 g .Next, the observed signal s was generated from y true by adding noise so that the input is given by the normal distribution with mean y true and variance σ 2 s .The generator did not remain constant: It changed with probability p t over time.When the generator changed, µ g , σ g were sampled uniformly from µ g ∈ [−0.5, 0.5], σ g ∈ [0, 0.8] respectively.
As mentioned in the Introduction, the prior distribution needed for Bayesian estimation must be estimated from the observed signal so that it is close to the generator distribution.Then, u(t) for Eq.1 (or 4,5) is given by using the Probabilistic Population Code (PPC), which has been proposed as the neural basis for Bayesian inference [Ma et al., 2006].PPC assumes that the information in a signal is encoded by a population of neurons with a position-based preferred stimulus that fires probabilistically according to a Poisson distribution.It has been shown that neural networks with a population of neurons following PPC as the input layer can learn probabilistic inference effectively[Orhan and Ma, 2017].Therefore, in this study, we also assumed that the activity u of the input-layer neurons encoding the observed signal followed the PPC model.u was sampled from the following Poisson distribution [Ichikawa and Kataoka, 2022]: Here, s is the observed signal generated from y true by adding noise, and f i is the tuning curve of the neurons.This selective firing occurs in proportion to the gain when the observed signal is generated.This gain is inversely proportional to the noise variance as g = 1/σ 2 l , and corresponds to signal clarity.Namely, the gain decreases and noise increases due to uncertainty in observations [Tolhurst et al., 1983].Considering the gain, we obtain: where φ i represents the preferred stimuli of neurons in the input layer.It was assumed that φ i follows an arithmetic sequence for i (φ i = −1/2 + i/m when the number of neurons in the input layer is m) [Swindale, 1998].Also, σ 2 PPC is a constant that represents the ease of firing and was set as σ 2 PPC = 1/2 in this study.In this task, the true value y true was to be estimated based on the input signal u.Therefore, training was performed to minimize the mean squared error (MSE) between the neural network output y(t) and the true value y true (t).Note that the loss function was not based on the Bayesian optimal value calculated from the generator distribution and the noise in the observed signal but only calculated based on the true value.
Training was performed by the backpropagation method [Rumelhart et al., 1986, Werbos, 1990].An efficient Stochastic Gradient Descent method, Adam[Kingma and Ba, 2014], was used for optimization.The batch size of training samples was set to 50, and the weight decay rate was set to 0.0001; training was performed for 6000 iterations(See Table .1 for the hyperparameters used in the experiment).
Results1: Fixed structure and time scales

Bayesian optimality
Because the generated signal s was observed under noise, the neural network was required to estimate the true value sampled from the generator.If the information from the generator was known, y true would be estimated by minimizing the long-term MSE, which reveals the optimal y value as follows (maximum a posteriori(MAP) estimation [Bishop, 2006]).
Figure 2: The output y of RNN against the observed signal value s.Before s is input, the time series signal, which is sampled from the normal distribution with the mean µ g = 0.5 and the standard deviation σ g = 0.5 and then the noise with the standard deviation σ s = 1/5 is added in the input.The accuracy can be increased by estimating prior based on the signal input before s and performing Bayesian inference.Blue points represent µ g value, orange points represent the output of RNN y, and green points represent estimation based on maximum likelihood estimation y M L = s.The result is for a model with α s = 0.1.
However, as described in the "Task" section, the information from the generator was not explicitly given to the neural network, so it must be estimated from observed signals as a prior distribution.First, we examined whether the neural network could achieve this prior-based estimation.
The output y of RNN with modular structure trained with α s = 0.1, when given an observed signal s, is shown Fig. 2. s was sampled from the prior with µ g = 0.5, σ g = 0.5, and σ s = 1/5 of noise was added.The green points represent the estimation based on the maximum likelihood estimation y M L , which is that with the highest accuracy when no prior information is available.Here, this estimation is equal to the observed signal s.The blue points represent y opt when estimated according to the MAP estimation, and the orange points represent the actual neural network output y.Fig. 2 shows that the output of RNN is closer to the blue points y opt rather than to the green points, indicating that approximate Bayesian inference (Near-optimal Bayesian inference) with a well-estimated prior is achieved (the mean squared error between y and y M L is 0.15, and the mean squared error between y and y opt is 0.019, the latter being smaller).
Next, we examined the optimality of the Bayesian estimation for networks with and without modular structures and time scale differences.Fig. 3(a) shows the MSE between y and y opt by the RNN trained under each condition.This result shows that the modular structure improved the accuracy of Bayesian estimation, which was further increased when α s decreased to an appropriate degree.In fact, we found the optimal time scale α s = 0.1 ∼ 0.2, at which maximum accuracy was achieved.Even without modular structure, the time scale difference contributed to inference accuracy, but the accuracy increased significantly with both the modular structure and time scale difference.

Adjustability to rapid generator switching
So far, we studied the performance of Bayesian inference models under a fixed generator to compare the accuracy of Bayesian inference itself.Next, we examined their performance when the generator changes in time.To perform Bayesian inference for a rapidly changing input, it was necessary for the model to quickly approach the new optimal value y opt to yield a good estimation.To verify the accuracy of the RNN in this case, we compared the MSE between y true (t) generated by the generator and the output y(t) of RNN under various p t (Fig. 2(b)).The model with α s = 0.1 was found to be more accurate for all values of p t .
Figure 3: (a) MSE between the optimal value y opt (t) and the output of RNN y(t), plotted against the time scale α s .• with modular and × without modular structure.RNNs with a modular structure is more accurate.In addition, those with α s ∼ (0.1 ∼ 0.2) have optimal error.(b) MSE between the true value y true (t) and the output of RNN y(t) for the network with α s = 1(•) and α s = 0.1(×).The value increases as p t increases, but the model with α s = 0.1 is always more accurate.
As a special case, we considered a setting where the input moves back and forth between two generators, A and B. Then we examined whether the prior distribution estimated by the RNN was closer to the distribution of either generator.Specifically, we adopted the generator A with (µ g , σ 2 g ) = (µ A , σ 2 A ) and the generator B with (µ g , σ 2 g ) = (µ B , σ 2 B ) and computed the following values when the Bayesian optimal estimates under each generators were y A opt , y B opt .
When a(t) is close to 1, the model's prior is closer to generator B, and when a(t) is close to -1, it is closer to prior A.
Comparing the change in a(t) between the model with α s = 0.1 and the model with α s = 1, we found that the model with α s = 0.1 was more adjustable to the generator change as shown in Fig. 4(a).This result shows that the model with α s = 0.1 was more responsive to the changes of the generators and recognized the generator change more quickly in all runs.The difference between the two models was especially pronounced in the extreme case in which the two generators switched every time(Fig.4(b)).Intuitively, having a population of slow neurons would seem to be a disadvantage in responding to rapid environmental changes, but the results showed the opposite.The network with α s = 1 could not follow rapid input changes, whereas that with α s = 0.1 could estimate the input prior effectively.We discuss the importance of slow neurons in responding to rapid changes below.

Representation of the prior
We investigated how the slow sub-module facilitated improved prior representation for Bayesian inference.Beginning from the hypothesis that a group of downstream slow neurons represent the prior by integrating the observed signal over time, we investigated which side of the main/sub-module was responsible for the prior information in the modular RNN.
Here, by using the prior information, the estimated value was shifted from the observed signal s to an appropriate value y opt (Eq.10).In other words, even given the same signal input s, the output varied depending on which time series signal was input before s (because the prior estimation changed).Even if one module returned to its original state, the output shifted from s because the prior information remained in the other module.The scale of this change is considered to represent the degree to which the module utilizes the estimated prior information.Therefore, it is possible to estimate the extent to which each module plays a role in prior information processing by examining the change in the output y(t) when the internal state of each main and sub-modules is changed to the value corresponding to a different prior.
First, let x m (µ g , σ g ), x s (µ g , σ g ) be the internal states of the main and sub-module, respectively when the input signal s from a generator (µ g , σ g ) is applied for a certain period.Because the output y is determined by the internal states of two modules and the input signal, it can be written as y(x m (µ g , σ g ); x s (µ g , σ g ), s, σ s ).From this, the change in output y is computed by fixing one of the two modules and varying the other to a different internal state x i (µ g , σ g ) → x i (µ g , σ g ).
The degree of change in y represents the impact on the output of each module reflecting the prior information.Hence, by comparing the above variances of y by x m (or x s ) with fixed x s (or x m ) respectively, it is possible to estimate how much each module is responsible for the prior representation.Specifically, we fixed one of the modules at µ g = 0, σ g = 0.4 (These values are set to the median of the range of values −0.5 ≤ µ g ≤ 0.5, 0 ≤ σ p ≤ 0.8), i.e. x i (0, 0.4), while for the other module µ g and σ g are changed as x g (µ g , σ g ).Then, we calculated the variance of y as V s = Var[y(x m (0, 0.4), x s (µ p , σ p ), s, σ l )] (µp,σp) (s,σ l ) , where Var[ ] (µp,σp) denotes the variance over the changes of (µ p , σ p ), and (s,σ l ) denotes the average over the changes of (s, σ l ).The larger V s or V m indicates that the sub-module or main module strongly reflects the difference in the prior distribution to the difference in output, respectively.
Dependencies of V s and V m on different α s are shown in Fig. 5.This result shows that when α s = 1 (i.e., the time scale is uniform), both the main and sub-modules contribute to the representation of prior distribution to the same degree.
Conversely, when α s = 0.1 ∼ 0.5, V s is much larger than V m , meaning that the sub-module selectively contributes to the representation of the prior.In particular, when α s = 0.1 and 0.2, the differentiation of representation between the main and sub-modules is more pronounced.Note that the contribution of the main module is large when α s = 0.01, probably because the time scale of the sub-module is too slow to code the information of the prior.Comparing of Fig. 5 and Fig. 3 shows that the highly accurate Bayesian inference is achieved when the prior distribution information is localized in the sub-module.
Next, we investigated how the prior is represented by the main and sub-modules by visualizing the neural activity by principal component analysis(PCA) [Mante andet al., 2013, Ichikawa andKaneko, 2021].First, x m (µ g , σ g ) and x s (µ g , σ g ) were computed for various (µ g , σ g ) in a model with α s = 0.1, and made PCA.The results were projected on a plane using the first and second principal components and color-coded according to µ g and σ g (Fig. 6(a,b)).The neural activity in the main module was loosely distributed on a one-dimensional manifold, represented by the first principal component(PC1).This PC1 approximately corresponded to the µ g value, although the distinction was not clear.In contrast, the activity in the sub-module was clearly represented by 2-dimensional manifolds, as in Fig. 6(b2), where PC1 corresponds to µ g and PC2 corresponds to σ g , rather well.
Then, we performed the same analysis on the model with α s = 1 (Fig. 6(c,d)).In this case, the manifolds of neural activities for the main and sub-modules did not change significantly.Both were represented in a one-dimensional manifold corresponding to µ g ; there was no axis corresponding to σ g .The decodability of σ g achieved in the internal states of sub-module with α s = 0.1 was not observed for α s = 1.In fact, the coefficient of determination when σ g was calculated by Ridge regression from the internal state of the sub-module with α s = 0.1 was 0.68, while that using the sub-module with α s = 1 is −0.03.This suggests that the model with α s ∼ 0.1 can better distinguish the input's variance from noise to accurately perform Bayesian inference.Figure 7: a k defined by Eq.14 is plotted against t, for the model with α s = 1 and α s = 0.1 using 3000 data points.
When the generator changed rapidly, the variance of the prior was larger than the variance of the generator, as shown in the SI for the case with α s = 0.1.When σ g was large, as seen from Eq.10, the influence of the observed signal s was larger than that of µ g , allowing the model to "keep up" with large changes in the observed signal.This explains the higher adjustability to rapid generator changes as seen in Fig. 4.

Effects of different time scales
To examine the impact of α s differences on Bayesian inference accuracy in detail, we considered how each model with α s = 1 and α s = 0.1 represents prior as a function of the input signal.As seen in Fig. 6, when the generator is constant, the internal state of the neural network corresponds with the state of the generator (µ g , σ g ).Conversely, when the generator changes, the internal state at a certain time does not necessarily correspond to the state of the generator at that time because some time is needed to estimate the state of the prior after the generator switches.Let µ p and σ p be the mean and standard deviation of prior used by the neural network to compute y(t).Hence, µ p must memorize the input s(t) for a certain time in the form of To examine how many past steps k are memorized, µ p must be estimated.This can be achieved by estimating µ p from the internal state x(t).
First, we calculated the internal state x for the observed signal with a fixed generator instead of a time-varying case.Then, we found the transformation matrix W µp from the internal state x to the recognized prior µ p by assuming that µ p can be represented by linear transformations of the internal state as µ p W µp x.This transformation matrix W µp was obtained by a pseudo-inverse method [Schrauwen et al., 2007](SI).
Next, we obtained x(t) against the time-varying signal with a probability of p t = 0.03.By applying the above transformation matrix, W µp to x(t) obtained at this time, the prior µ p was estimated accordingly.The state of the prior was thus obtained for the time series of the observed signal s(t).
Then, a k in Eq.14 was obtained to minimize the difference between the two sides of Eq.14.Because the obtained coefficients correspond to the contribution of the signal before k time steps, we could estimate the extent to which the neural network uses past information when estimating the prior.Each was normalized so that the maximum value was 1.
The estimated coefficients of Eq.14 were plotted against k(Fig.7), revealing that the model with α s = 0.1 used more past information in estimating prior information than the model with α s = 1.This difference in time windows leads to a difference in accuracy for prior encoding.
Results2: Modular structure organization and time-scale separation by learning So far, we investigated neural networks with fixed and modular structures along fixed time scales and demonstrated that those with fast and slow modules effectively represented the prior distribution.Then, we investigated whether such a structure would emerge by training a neural network to predict y true .We again used the same neural network model as the normal RNN.
x(t + 1) = (I − α)x(t) + αReLU(W in u(t) + W rec x(t)) + √ αξ, (15) where α represents a vector of time scales of neurons consisting of α i .These α values, as well as elements of W , change by training to start from initial values set randomly according to N (0.5, 0.1).During training, each matrix W and α are optimized according to the gradient descent method [Perez-Nieves et al., 2021] at each step.The number of neurons in the recurrent layer of the neural network was set to 80.
The change in α distribution during the learning task is shown in Fig. 8(a).As shown, α split into two groups over the learning period: one with large values close to 1 and the other with small values near 0.1.
Next, we measured the contribution of prior representation as examined in the "Representation of the prior" section for groups of neurons with large values α (neurons with α i > 0.8) and groups of neurons with small values α(neurons with α i < 0.2) for three epochs in the learning process(Fig.8(b)).We found that after 10000 epochs, the slow neurons were responsible for the representation of prior distribution, as in the model with α s = 0.1 in the fixed time scale setting.
Finally, we investigated the neural network structure shaped by training.In Fig. 8(a), the recurrent layer neurons of the network of epoch 10000 was split into the three groups, divided by the magnitude of α i , slow neurons with α i < 0.2, fast neurons with α i > 0.8, and 0.2 ≤ α i ≤ 0.8 neurons as the others.The average connectivity between the input layer, each group, and the output layer is shown in Fig. 8(c) [Yang et al., 2019].The connection from the input layer to the group of fast neurons and that from the fast neurons to the output layer were distinctively larger than those to or from the slow neurons.Among connections within the recurrent layer, those between the fast and slow neurons were larger than others.In summary, a modular structure, shown in Fig. 1(b), emerged through learning alone.

Discussion
In this study, we demonstrated that neural networks with slow and fast activity modules play an essential role in the prior representation for Bayesian inference.We set up a task to predict a time-varying signal under noise that could be estimated by Bayesian inference and trained RNNs with or without modular structure and with or without time scale differences.
The RNN could learn to approximate Bayesian inference using prior(approximating the generator distribution) in all conditions tested.However, the accuracy was higher in the modular RNN; further, the accuracy was significantly higher when the time scale of the sub-module was moderately slower than that of the main module.In addition, the increase in accuracy was pronounced against a rapidly varying input, for which it was necessary to generate a prior that changes quickly.To achieve such accuracy with a slow sub-module, the sub-module was found to dominantly represent the prior, indicating role differentiation between representation of the prior and representation of the observed signal (likelihood).Of note, such functional differentiation is caused by differences in time scales.This result is consistent with experimental observations in the brain in which areas that code the prior and likelihood in Bayesian inference are different [Vilares et al., 2012, Chan et al., 2016, d'Acremont et al., 2013].Finally, it was shown that a modular structure with distinct time scales was spontaneously organized in the RNN by learning.
It is important to note that a relatively slow time scale of the neuron population encoding the prior is required, but the difference between fast and slow neurons should not be excessive.If the time scale is too small, the accuracy is decreased (Fig. 3) in which case the sub-module is not responsible for representing the prior (Fig. 5).This is because prior construction requires a larger time span to address changes in external input for a neural network with such a slow time scale.Therefore, we suggest that there is an optimal time scale for the slow sub-module.Future research should investigate how this optimal time scale depends on the time scale of environmental changes.
It has been suggested that the time scale of neurons slows down hierarchically from the area where the signal is directly applied to the area where information is proceed [Murray et al., 2014, Cavanagh et al., 2020, Golesorkhi et al., 2021].This time scale hierarchy with a modular structure [Yamashita and Tani, 2008] is suggested to be relevant to information processing [Kurikawa and Kaneko, 2021, Yamashita and Tani, 2008, Tanaka et al., 2022].Our study showed that modular structures with two-level time scales could deal with slowly changing inputs.A deeper modular structure with multiple time scales may be necessary to deal with further complex changes in environments.With such a structure, Bayesian inference against complex temporal changes could be achieved by extrapolating the results of this study.Further research verifying this finding will elucidate the significance of hierarchical structuring in the brain.It is noteworthy that the time scale separation was not only found to be influential for accurate Bayesian inference but also emerged from learning in our simulation.Considering these findings, a similar process may be expected in evolution [Yamaguti and Tsuda, 2021].
The modular network with slow/fast time scales could integrate out noise and distinguish the average change in the inputs from fast noise.In fact, the network could effectively predict temporal changes in the input, even under rapidly changing conditions.The brain must adapt to time-varying, noisy inputs; hence, the performance of Bayesian inference by the network design reported herein is considered relevant to brain information processing.We adopted a simple RNN and trained it using backpropagation.redBackpropagation is often believed to be different from the learning algorithm implemented in the brain [Bengio et al., 2015, Lillicrap et al., 2016], so care should be taken when generalizing our results.However, previous studies have also suggested that neural networks obtained by backpropagation can show similar behavior to that of the actual brain [Richards et al., 2019, Yang and Wang, 2020, Mante and et al., 2013, Barak et al., 2013, Cueva and Wei, 2018, Yamins and DiCarlo, 2016, Haesemeyer et al., 2019].With these considerations, our findings are considered to be relevant to the brain's learning processes despite the potential limitation of backpropagation.
Unraveling the relationship between the structure of neural networks, neural dynamics, and the information processing performed by the brain is a primary goal in computational neuroscience [Mastrogiuseppe and Ostojic, 2018, Dubreuil et al., 2022, Vyas et al., 2020, Beiran et al., 2021].In this study, the relevance of modular structure and time scale difference in neural dynamics to the representation of the prior in Bayesian inference is demonstrated, as well as their formation by learning [Lorenz et al., 2011, Kashtan andAlon, 2005], which will support ongoing research in the field.

Figure 1 :
Figure 1: Schematic of RNN.(a) Standard RNN without modular structure (b) RNN with modular structure

Figure 5 :
Figure 5: Division of roles for representing prior distribution.V s , V m defined in the text Eqs.(12,13) plotted for different values of α s computed over 1000 samples of data.V s and V m represent the degree to which the sub-module and the main module are responsible for prior-based information processing.When α s = 0.2, 0.1, the sub-module selectively contributes to the representation of the prior.

Figure 8 :
Figure8: RNN features obtained by learning when α is variable by learning.(a) Frequency distribution of α for all neurons at 200, 1000, and 10000 learning epochs.At 10000 epochs, the learning process was complete.(b) Division of roles V slow and V f ast .See the "Representation of the prior" section for definitions of V slow and V f ast ; V slow (V f ast ) was computed for neurons with α < 0.2 (α > 0.8) respectively.(c) The average degree of RNN connections of 10000 epochs.Connections between input layer, recurrent neurons with α < 0.2, 0.2 ≤ α ≤ 0.8, α > 0.8, and output layer.Each was normalized so that the maximum value was 1.

Figure 11 :
Figure 11: (a)Comparison between the estimated mean of prior µ p and the mean of generator µ g .(b)Comparison between the linear weighted sum of past signals s(t − k) and the estimated mean of prior µ p .