A new method of Bayesian causal inference in non-stationary environments

Bayesian inference is the process of narrowing down the hypotheses (causes) to the one that best explains the observational data (effects). To accurately estimate a cause, a considerable amount of data is required to be observed for as long as possible. However, the object of inference is not always constant. In this case, a method such as exponential moving average (EMA) with a discounting rate is used to improve the ability to respond to a sudden change; it is also necessary to increase the discounting rate. That is, a trade-off is established in which the followability is improved by increasing the discounting rate, but the accuracy is reduced. Here, we propose an extended Bayesian inference (EBI), wherein human-like causal inference is incorporated. We show that both the learning and forgetting effects are introduced into Bayesian inference by incorporating the causal inference. We evaluate the estimation performance of the EBI through the learning task of a dynamically changing Gaussian mixture model. In the evaluation, the EBI performance is compared with those of the EMA and a sequential discounting expectation-maximization algorithm. The EBI was shown to modify the trade-off observed in the EMA.


Introduction
The aim of Bayesian inference is to deduce the hidden cause behind observed data by retrospectively applying statistical inferences. The relationship between Bayesian inference and brain function has attracted significant attention in recent years in the field of neuroscience [1,2]. In Bayesian inference, the degree of confidence for each hypothesis is updated based on a predefined model for each hypothesis by incorporating the current observational data. In other words, Bayesian inference is a process of narrowing down hypotheses (causal candidates) to one that best explains the observational data (the effects).
As an example, consider a situation in which one attempts to read another's emotions. Because one cannot directly view another's emotions, they can only be inferred from external cues (observation data), such as facial expressions and voice tones. The unknown emotions correspond to the hypotheses in Bayesian inference; i.e., it is the inferred target. In addition, a probability distribution representing what type of facial expression appears in what proportion when the other has a specified emotion corresponds to a model for each hypothesis. For example, if one has a model that "If a person is pleased, the person will smile, with an 80% chance," and if one observes that person to smile frequently, they will be more confident in the hypothesis that "the person is pleased." That is, by observing the data, the "effect" of a "smile," the emotion of "joy" is presumed as the cause of the "smile." Attention should be paid to the following two points. First, to infer someone's emotions more accurately, it is better to have as much observation data as possible, but only if it is ensured that the emotion will not change during the estimation. Emotions change from moment to moment. In such an unsteady situation, it is necessary to consider whether the observation data are derived from the same emotion throughout the period being observed for the estimation. This may be a problem with online clustering of non-stationary data. The second point is that because a model for a stranger cannot be given in advance, it is necessary to learn and construct it from observed data. If this model is wrong, one cannot obtain the correct results from observations to determine the emotion of the person.
Regarding the second point, there are methods such as the expectation-maximization (EM) algorithm and the K-means algorithm that perform inference and learning simultaneously. The EM algorithm is a method for obtaining the maximum likelihood estimate in a hiddenvariable model [3] and it is often used for mixture models or latent-topic models, such as latent Dirichlet allocation [4]. K-means is the non-stochastic version of the EM algorithm [3].
In the previous example, hidden variables correspond to emotions such as joy or anger because they are not observable directly. By using the EM algorithm, a person's emotions can be estimated from the observed data while creating a model of emotion, based on the observed data. However, in the EM algorithm, it is necessary to provide all the observational data at one time. In practice, there are cases where data processing must be performed sequentially after each time the data are observed.
Various online algorithms have been proposed to deal with this situation [5][6][7][8][9][10][11][12][13]. For example, Yamanishi et al. [7] proposed a sequential discounting expectation-maximization (SDEM) algorithm that introduced the effect of forgetting to deal with unsteady situations where the inferred target changed. The algorithm is used in the fields of anomaly detection and change point detection. We also proposed EBI, incorporating causal reasoning into Bayesian inference, like an algorithm that performs inference and learning simultaneously [14].
In the field of cognitive psychology, experiments on causal induction have been performed to identify how humans evaluate the strength of causal relations between two events [15][16][17][18][19]. In a regular conditional statement of the form "if p then q," the degree of confidence is considered to be proportional to the conditional probability P(q|p), which is the probability of occurrence of q given the existence of p [20]. In contrast, in the case of a causal relation, it has been experimentally demonstrated that humans have a strong sense of causal relation between a cause c and an effect e when P(c|e) is high, as well as when P(e|c) is high. Specifically, the causal intensity that people feel between c and e can be approximated by the geometric mean of P(e|c) and P(c|e). This is called the "dual-factor heuristics" (DFH) model [19]. If the causal intensity between c and e is denoted as DFH(e|c), then DFHðejcÞ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi PðejcÞPðcjeÞ p 6 ¼ PðejcÞ. Here, note that DFH(c|e) = DFH(e|c) is valid. Such inference is called "symmetry inference." In this paper, we first describe the EBI, which replaces conditional inference in Bayesian inference with causal inference. Second, we show that the learning effect and forgetting effect are introduced into Bayesian inference by this replacement. Third, we evaluate the estimation performance of the EBI through the learning task of a dynamically changing Gaussian mixture model. In the evaluation, the performance is compared with SDEM, an online EM algorithm.

Bayesian inference
In Bayesian inference, several hypotheses h k are first defined and models (probability distributions of data d for the hypotheses) are prepared in the form of conditional probabilities P(d| h k ). This conditional probability is called "likelihood" in the case that the data are fixed, and this probability is considered to be a function of the hypothesis. In addition, the confidence P (h k ) for each hypothesis is prepared as a prior probability. That is, one must have some prior estimate of the probability that h k is true.
Assuming that the confidence for hypothesis h k at time t is represented as P t (h k ) and data d t were observed, the posterior probability is calculated as follows using Bayes' theorem.
where P t (d t ) is the marginal probability of d t at time t and is defined as follows.
Next, the posterior probability that resulted from the analysis becomes the new prior estimate in the update.
Each time the data are observed, the inference progresses by updating the confidence for each hypothesis using formula (4). Note that in this process, the confidence P t (h k ) for each hypothesis changes over time, but the model P(d|h k ) for each hypothesis does not change.
If we focus on the recursiveness of P t (h k ), formula (4) can be rewritten as Here, the denominator P i (d i ) is common to all hypotheses and can be considered as a constant. Therefore, if normalization processing is omitted, formula (5) can be written as follows.
That is, the current confidence for a hypothesis is proportional to the prior probability multiplied by the likelihood of the data observed so far. connection between two events c and e as follows.
Similarly, we defined C(c|e) as follows.
These formulas represent the generalized weighted averages of P(e|c) and P(c|e). The generalized weighted average of variables x and y is expressed by the following formula using parameters α and m.
When m<0, the values at both ends cannot be defined, but as x approaches 0 or 1, μ(x|0.5, m) approaches 0. In other words, in the range of m�0, if either x or y approaches 0, their mean also approaches 0.
Here, if we describe formulas (12) and (13) recursively, and replace c and e with h k and d t , respectively, we can get the next formulas.
Formula (15) can be rewritten as follows by using a Bayesian update.
In formula (17), a description of the normalization process for setting the confidence as a probability is omitted.
If α = 0, C t (d t |h k ) does not change by formula (14), as shown below.
In other words, if α = 0, then formula (14) substantially disappears and the EBI becomes the same as Bayesian inference. In contrast, in the case of α > 0, the likelihood is modified by formula (14). In this study, we only update the likelihood of the hypothesis with the highest confidence at that time instead of updating the likelihood of all hypotheses. That is, the following formula is used instead of formula (14).
Hereafter, the hypothesis with the highest confidence at time t is denoted by h t max . If there are multiple hypotheses with the highest confidence, one of them is selected at random.
In the following, let us analyze the case of m = 0, that is, the geometric mean case. In the case of m = 0, formula (17) can be transformed as follows.
If we focus on the recursiveness of C t (h k ), formula (21) can be rewritten as follows.
Here, the denominator C i (d i ) is common to all hypotheses and can be considered as a constant. Therefore, if the normalization processing is omitted, formula (20) can be written as follows.
This can be understood as indication that the current confidence for a hypothesis is proportional to its prior probability multiplied by the likelihood designed to weaken the weight of the distant past. In the case of α = 0, that is, Bayesian inference, This means that the current likelihood and the past likelihood are weighted equally.
However, in the case of α = 1, C t+1 (h k ) C t (d t |h k ). It means that the confidence is calculated using only the current likelihood. Thus, it can be said that the EBI introduces the effect of forgetting into Bayesian inference when considering past history.
In the case of m = 0, with respect to h t max , formula (20) can be written as follows.
In the case of C t ðh t max Þ > C t ðd t Þ, the likelihood becomes larger, and in the case of C t ðh t max Þ < C t ðd t Þ, the likelihood becomes smaller. This means that the model is corrected based on the observed data. Thus, it can be said that the EBI introduces the effect of learning into the Bayesian inference.

Testing a normal distribution model
Mean value estimation using normal distribution. In this study, we deal with onedimensional continuous probability distribution, such as one-dimensional normal distribution as a concrete model for the hypothesis. The model of hypothesis k at time t is denoted by Fðdjy t k Þ. Here, y t k represents the parameter of the model. In the case of normal distribution, where μ and S represent the mean and variance, respectively. This section describes the mean value estimation. The variance estimation will be described in the next section. When a normal distribution is used as a model, C t (d|h k ) and C t (d) are probability densities, while C t (h k ) is a probability because the number of hypotheses is discrete and finite. Thus, when calculating formulas (23) and (24), a positive number Δ is introduced and approximately calculated as follows.
Here, y t max represents the parameter of the distribution that is the model for h t max . In formula (25), the term Δ is common to all hypotheses and can be canceled by normalization. Thus, if normalization processing is omitted, it can be expressed as follows.
In formula (27), if the confidence for a hypothesis becomes zero once, it remains zero thereafter. To prevent this, normalization processing (smoothing) is performed by adding a small positive constant ε to the confidence of each hypothesis obtained by formula (27).
Here, K represents the total number of hypotheses. In this study, we set ε = 10 −10 .
Having observed the data d t , the likelihood is changed to C tþ1 ðd t jh t max Þ by formula (26). Concomitantly, the parameter of the model for the hypothesis is modified from y t max to y tþ1 max so that the following equation is satisfied.
If F is a normal distribution, Eq (29) can be described as follows.
Updating the variance from S t max to S tþ1 max is described in the next section. Solving formula (30) for m tþ1 max leads to the following two solutions.
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 2pS tþ1 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 2pS tþ1 m t max reflects the past observed data. We determine m tþ1 max as the one closer to m t max , among the two solutions μ 1 and μ 2 to account for the past data as much as possible.
( However, to solve formula (29), C tþ1 ðd t jh t max Þ needs to be within the range of 0 < C tþ1 ðd t jh t max Þ � 1 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 2pS tþ1 max p . Thus, we set the following restrictions after calculating C tþ1 ðd t jh t max Þ using formula (26).
C tþ1 ðd t jh t max Þ min maxðC tþ1 ðd t jh t max Þ; εÞ; 1 ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 2pS tþ1 where max(x,y) represents the larger of x and y. Conversely, min(x,y) represents the smaller of x and y. We set ε = 10 −10 .
If K = 1, there is no other hypothesis. Therefore, the only hypothesis always becomes h t max and the value of confidence is always 1. Consider the situation C t ðh t max Þ � 1, including the case of K = 1. In this case, the confidence of hypotheses other than h t max becomes almost 0 by the constraint max Þ is derived from formula (16). Therefore, formula (26) can be transformed as follows.
If formula (34) is denoted by 1À a , f(x t ) becomes a concave function.
In this study, we set D ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 2pS t max p . In this case, C tþ1 ðd t jh t max Þ approaches the vertex of the normal distribution whenever data d t are observed.
As shown in formula (29), m tþ1 max is determined to satisfy the condition C tþ1 ðd t jh t max Þ ¼ Nðd t jm tþ1 max ; S tþ1 max Þ. This means that m tþ1 max approaches the observation data d t . Through the processing described above, the confidences for each hypothesis and the model for the hypothesis with maximum confidence are corrected whenever the data are observed.
We will hereinafter refer to the latter process of modifying the model for h t max as inverse Bayesian inference [21][22][23][24]. If the former process of updating the confidences for hypotheses is referred to as inference, inverse Bayesian inference can be called "learning" because it forms a model for a hypothetical instead of an inference. Thus, although the two α s in formulas (26) and (27) are denoted by the same α, they can be called the "learning rate" and "forgetting rate," respectively. We can also set the "learning rate" and "forgetting rate" as two independent parameters. However, when dealing with temporal alteration, like in this study, good performance is achieved when the two parameters have almost identical values. On the contrary, it is preferable to set the parameters separately in spatial clustering.
Variance estimation using gamma distribution. Consider a random variable D that is the sum of n squares of data sampled from a normal distribution N(0,S) with mean 0 and variance S.
In this case, D follows the gamma distribution with shape parameter S = n/2 and scale parameter λ = 1/(2S), as shown below.
The mean of this distribution is S/λ = nS. We set n = 20, that is S = 10. We use the gamma distribution as a model for estimating variance from observed data d t .
First, the following D t k is calculated using the mean estimated value m t k obtained in the previous section. This is used as input data for variance estimation instead of the observation data d t .
Here, n k represents the number of times that hypothesis k has been the hypothesis with the highest confidence. The initial value of n k is 0. x % y represents the remainder when integer x is divided by integer y.
The model of hypothesis k at time t is FðDjy t k Þ ¼ f ðDjy t k Þ; y t k ¼ ðl t k ; SÞ. In this case, formula (29) is rewritten as For this equation to have a solution with respect to Y tþ1 max in the range of Y tþ1 max < 0, −1/e�Z t+1 <0 must be satisfied. Therefore, the following restrictions are provided.
Here, W −1 and W 0 are two Lambert W functions that satisfy Z = xe x ,x = W(Z). Similar to the case of estimating the mean value, Y tþ1 max is determined as , the scale parameter l tþ1 max can be calculated as Further, because l tþ1 max ¼ 1=ð2S tþ1 max Þ, the variance estimate S tþ1 max is calculated as We use S tþ1 max as the variance estimate for the next time in generation distribution. That is, Regarding the value of Δ, we consider the gamma distribution in which the current input value D t max is the mean value S/λ, and define the output value for the input value D t max as 1/Δ, based on the same arguments as in the case of normal distribution.
The group of processes described in this section and the previous section is summarized as an algorithm below.

Establish initial values for y
Repeat the following whenever data d t are observed.
• Find the hypothesis h t max with the maximum confidence. • Create the input data D t max for variance calculation using formula (37). • Update the likelihood C tþ1 ðD t max jh t max Þ of the hypothesis h t max for the input data D t max using formula (26).
• Correct the variance S tþ1 max of the model for the hypothesis h t max using formulas (41), (42), (43), and (44) to match the new likelihood C tþ1 ðD t max jh t max Þ. • Update the likelihood C tþ1 ðd t jh t max Þ of the hypothesis h t max for the observed data d t using formula (26).
• Correct the mean m tþ1 max of the model for the hypothesis h t max using formulas (31) and (32) to match the new likelihood C tþ1 ðd t jh t max Þ.
• y tþ1 max ¼ ðm tþ1 max ; S tþ1 max Þ is set as the new parameter of the model for the hypothesis h t max .

Sequential Discounting Expectation-Maximization Algorithm (SDEM)
This section describes SDEM, an online EM algorithm proposed by Yamanishi et al. [7]. In SDEM, the E and M steps are executed once for each data observed sequentially. First, in step E, the responsibility is calculated. The responsibility for the normal distribution k of the data d t is calculated as follows.
Here, X K k¼1 p t k ¼ 1 is assumed. π k is called the "mixing weights" and represents the weight of each normal distribution.
Next, in the M step, the mixing weights, means, and variances of each normal distribution are updated. However, weighting is performed to weaken the influence of older observation data by introducing the discounting rate β(0<β<1).
Regardingp tþ1 k , smoothing is performed to prevent it from becoming 0 and normalize, similar to the EBI.
We set γ = 0.001 for optimal performance. In the case of K = 1, q t 1 andp t 1 are always 1. At this time, formula (50) shows that the new estimated value is obtained as a convex combination of the current estimated value and the current observed data.
This represents the EMA. By setting p tþ1 (47) and (52), P t+1 (h k ) can be described approximately as follows.
Taking the logarithm of both sides of formula (23) in the EBI, the following transformation can be made.
Comparing formula (55) with formula (54), the EBI differs in that it takes a logarithm and uses likelihood instead of posterior probability. The group of processes described above is summarized as an algorithm below.

Establish initial values for the variables y
Repeat the following whenever data d t are observed.
• E step: Calculate the responsibility q t k for each normal distribution k of the observed data d t using formula (46).

Simulation
To investigate the behavior of EBI, a simulation was performed. In the simulation, one random number d t is generated at each time from a certain normal distribution (the "generation distribution"). Then, the EBI estimates the generation distribution by observing d t .
In this study, we deal with a task in which the mean and variance of the generation distribution fluctuate randomly at each regular interval. Specifically, every 1000 steps, a random number from a uniform distribution of the range [0, 5] is generated, and the number is set as a new mean of distribution. Similarly, a random number from a uniform distribution of the range [0, 0.1] is generated, and the number is set as a new variance of the distribution.
( Here, m t correct and S t correct represent the mean and variance of the normal distribution used as the generation distribution at time t, that is, the correct values in this task.
rnd t represents a random number generated from a continuous uniform distribution of the range [0, 1] at time t. Fig 1 shows an example of time evolution of observation data d t in this task.
In the simulation, estimations were performed via the EBI, SDEM, and EMA for comparison. The parameter estimated by the EBI at time t is that of the model for h t max , that is y t max ¼ ðm t max ; S t max Þ. The parameter estimated by SDEM is that of the distribution for which the observed data d t have the highest responsibility among the normal distributions included in the Gaussian mixture distribution, that is, y t m t ¼ ðm t m t ; S t m t Þ; m t ¼ arg max k q t k . The mean m t EMA and variance S t EMA for the EMA are updated as follows.

Fig 2(A)
shows an example of the result by EBI. In this simulation, the initial values of the mean and variance of the model for each hypothesis were set to m 1 k ¼ 2:5 and S 1 k ¼ 0:05, respectively. Fig 2(A) shows the time evolution of the correct value m t correct and that of the estimated result by the EBI, set to K = 10. Fig 2(B) shows the result obtained by the EBI set to K = 1 (i.e., the result for inverse Bayesian inference) . Fig 2(C) shows the estimation results obtained by three types of EMA with different discounting rates β. It is evident that for larger discounting rates, the responses to sudden changes are quicker, as expected, but the fluctuations are increased during the stable period.
In the case of EBI, initially, it takes time to follow up when the correct value suddenly changes. However, there are cases where changes can be handled instantly over time. In contrast, in the cases of inverse Bayesian inference and EMA, the follow-up performances are not improved over time at all. Fig 3(A) shows the time evolution of the means m t k of the models for ten hypotheses used in the simulation of Fig 2(A). Fig 3(B) shows the time evolution of the hypothesis h t max with the maximum confidence. Initially, all hypothesis models are the same, but various hypothesis models are formed by learning over time. Additionally, it is evident that it is possible to follow quickly by appropriately switching the hypotheses. When the EBI is set to K = 1 and EMA, because the hypotheses cannot be switched, such quick tracking cannot be achieved.

PLOS ONE
To evaluate the estimation performance of each method, the root-mean-square error (RMSE) between the estimated value and correct value is calculated as follows.

RMSE ¼
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi X TþjÀ 1 Here,m t and m t correct represent the estimated value and correct value at time t, respectively. T represents a period for evaluation.
Each interval of 1000 steps, from a change in the generation distribution to the next change, is divided into two halves. We use the RMSE of the first half as a measure of the inability to follow rapid changes and that of the second half as a measure of the inaccuracy of the estimation in the stable period. For the EMA, it is evident that there is a trade-off, i.e., the accuracy decreases as followability increases. The EBI can modify the trade-off observed in the EMA. In the case of K = 1, that is, even if only the inverse Bayesian inference is used, the trade-off can be improved, but the performance is improved as the number of components is increased. SDEM can also modify the trade-off but there is no noticeable difference depending on the number of components. Fig 5 shows the time evolution of the mean estimated value of each method where the number of components and discounting rate are selected to achieve the best performance. The correct value is also shown in the figure.

PLOS ONE
performed to obtain the total RMSE calculated from the entire interval for each pair of parameters. Fig 7 shows the total RMSE for each pair of parameter values. In most α regions, the RMSE is low when m � 0. When m exceeds 0, the RMSE increases.

Discussion and conclusions
In general, if a method such as the EMA with the discounting rate is used to improve the followability to a sudden change, it is necessary to increase the discounting rate. This means that in the estimation, the recent data are weighted more extensively. That is, as long as a constant discount rate is used, a trade-off exists; the followability is improved by when the discounting rate is high, but the accuracy is reduced.
In this study, we simulated the task of estimating the distributions for data generation in a non-stationary situation wherein the distributions change suddenly. Consequently, the EBI proposed in this study successfully modified the trade-off observed in the EMA.
In addition, we compared the estimation performance of EBI with that of SDEM. The EBI showed higher estimation performance.
However, as shown in Fig 7, m must be 0 or less to achieve high performance. In the literature [14], we derived α and m that best fit the causal strength felt by humans from formula (7) and the eight types of experimental data shown in the literature [15][16][17][18][19]. Accordingly, the values of α were in the range of 0.25 to 0.6. In other words, they were far from α = 0, which implies conditional probability. In contrast, the values of m were interestingly in the range of -2.0 to -0.25, that is, were negative in all eight experiments [14]. We did not determine the cause of these negative values in this study. This is a question for further study.
As shown in Fig 6, some bursts were observed in the variance estimates by the EMA and SDEM. The changes in the mean and variance occurred simultaneously in this simulation. Therefore, the delay in following the changes in the mean may cause confusions between the changes in the mean and those in the variance.
In the EBI, various models for the hypotheses are formed by inverse Bayesian inference, even if appropriate models are not given beforehand. After some models are accumulated, rough inference is performed by switching them and fine adjustment is performed by inverse Bayesian inference, thereby achieving both followability and accuracy. The situations where both learning and inference are performed also exist in daily life. For example, in estimating the emotions of others, one cannot have a complete model for someone else's emotions because "you" are not "them." Assume that someone's facial expression suddenly changed when you estimated that the person feels happy based on your currently incomplete model. Further, assume that it was the first facial expression you saw. At this time, it is possible to think that the person's emotion has changed from joy to another emotion. However, it is also possible to consider that it is a new facial expression representing joy.
It is more difficult to detect a sign of change immediately at a time when a change is occurring than to detect a change point by looking back at the past after a change has occurred and persisted. This is because while detecting a change, the decision must be made in a situation where there is no model regarding the new stage after the change. That is, in the example above, when the next facial expression model is not completely created. Under such circumstances, while efficiently learning a model from the observed data, there is a need for a technique for making an appropriate decision using the model. In the future, such techniques can be expected to be applied for, for example, the detection of the signs of a disease.
The EBI can be regarded as introducing the effects of forgetting and learning into the Bayesian inference due to the action of exponential smoothing using α. With the introduction of discounting rate α, the influence of much older data is weakened. Simultaneously, α

PLOS ONE
represents the learning rate and the learning process modifies the model for the hypothesis based on the observed data. In this framework, even with the same hypothesis, the content (model) changes over time; therefore, it is not possible to simply accumulate experiences from the past. For this reason, it is reasonable to include the effect of forgetting.
The EBI framework is very similar to the SDEM framework proposed by Yamanishi et al [7]. However, there are some differences. For example, when the history is considered for updating the weight of the Gaussian mixture distribution, there is a difference between the consideration of posterior probability or likelihood, and whether they are accumulated as addition or multiplication. In the future, we would like to clarify the difference in effectiveness due to these differences through the simulation of various tasks.
As limitations, in this simulation, only one-dimensional distribution was handled. In future work, we will extend our model to multidimensional distribution.