Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Simple modification of the upper confidence bound algorithm by generalized weighted averages

  • Nobuhito Manome ,

    Contributed equally to this work with: Nobuhito Manome, Shuji Shinohara, Ung-il Chung

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Validation, Visualization, Writing – original draft

    manome@bioeng.t.u-tokyo.ac.jp

    Affiliations Department of Bioengineering, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan, Department of Research and Development, SoftBank Robotics Group Corp., Tokyo, Japan

  • Shuji Shinohara ,

    Contributed equally to this work with: Nobuhito Manome, Shuji Shinohara, Ung-il Chung

    Roles Conceptualization, Writing – review & editing

    Affiliations Department of Bioengineering, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan, School of Science and Engineering, Tokyo Denki University, Saitama, Japan

  • Ung-il Chung

    Contributed equally to this work with: Nobuhito Manome, Shuji Shinohara, Ung-il Chung

    Roles Conceptualization, Writing – review & editing

    Affiliation Department of Bioengineering, Graduate School of Engineering, The University of Tokyo, Tokyo, Japan

Abstract

The multi-armed bandit (MAB) problem is a classical problem that models sequential decision-making under uncertainty in reinforcement learning. In this study, we propose a new generalized upper confidence bound (UCB) algorithm (GWA-UCB1) by extending UCB1, which is a representative algorithm for MAB problems, using generalized weighted averages, and present an effective algorithm for various problem settings. GWA-UCB1 is a two-parameter generalization of the balance between exploration and exploitation in UCB1 and can be implemented with a simple modification of the UCB1 formula. Therefore, this algorithm can be easily applied to UCB-based reinforcement learning models. In preliminary experiments, we investigated the optimal parameters of a simple generalized UCB1 (G-UCB1), prepared for comparison and GWA-UCB1, in a stochastic MAB problem with two arms. Subsequently, we confirmed the performance of the algorithms with the investigated parameters on stochastic MAB problems when arm reward probabilities were sampled from uniform or normal distributions and on survival MAB problems assuming more realistic situations. GWA-UCB1 outperformed G-UCB1, UCB1-Tuned, and Thompson sampling in most problem settings and can be useful in many situations. The code is available at https://github.com/manome/python-mab.

Introduction

The multi-armed bandit (MAB) problem refers to the problem of maximizing gain in a setting with multiple arms, where a reward can be obtained from the arms with a certain probability by choosing an arm [1]. This problem is considered the most fundamental in reinforcement learning because it involves a tradeoff between exploration for an arm with a high reward probability and exploitation to select an arm that is believed to have a high reward probability [2]. The MAB problem has a wide range of applications, including online advertising [3,4,5], recommendation systems [6,7], and games [8,9,10]. Algorithms for MAB problems are useful in many situations; however, it is often difficult to determine which algorithm to use. Typical algorithms for MAB problems include upper confidence bound (UCB) policies [11] and Thompson sampling [12]. Thompson sampling achieves empirically superior performance [13] and is known to outperform UCB policies within a finite number of trials [14]. These algorithms are popular because of their ease of implementation and guaranteed near-optimal theoretical performance [15,16]. Therefore, both UCB policies and Thompson sampling are frequently applied in reinforcement learning [17,18]. Two state-of-the-art algorithms, information-directed sampling (IDS) [19] and TS-UCB [20], are also based on them, but both require large amounts of computation.

In this study, we extend UCB1, the most representative algorithm among UCB policies that uses generalized weighted averages, and propose a new generalized UCB (GWA-UCB1) that can be executed with low computational cost. GWA-UCB1 is a two-parameter generalization of the balance of exploration and exploitation in UCB1 and can be implemented by simply modifying the UCB1 formula. UCB-based models such as Q-learning models with UCB [21] and reinforcement learning models with a Monte Carlo tree search [22,23,24] may be easily extended. The contribution of this study is the presentation of a set of parameters and an MAB algorithm that can be used effectively in various problem settings with low computational cost. However, it is important to note that it does not guarantee optimal regret from a theoretical perspective. The GWA-UCB1 with two parameters proposed in this study outperformed Thompson sampling in many situations. Furthermore, by adjusting the parameters according to the environment, performance close to that of the IDS was achieved.

The remainder of this paper is organized as follows. First, we introduce a review of UCB policies. Next, we explain the stochastic MAB problem and survival MAB problem, which are frameworks of MAB problems. Finally, we describe the proposed GWA-UCB1, discuss the experiments conducted to demonstrate its performance, and present the results and utility of the proposed method.

Related works

To maximize the reward obtained in the MAB problem, it is necessary to strike a balance between searching for the best arm and using knowledge obtained from the search results about the reward probability of the arm to select the arm believed to be the best. UCB policies are algorithms with a theoretically guaranteed upper bound on the expected loss [11], and they strike a good balance between exploration and exploitation. UCB1 is a representative algorithm; it first selects all arms at once and then selects the arm with the highest UCB score, as defined in Equation (1).

            (1)

where is the expected value of arm , is the number of selections for all arms, and is the number of selections for arm . Because the UCB score increases as decreases, even if is small, an arm with a small number of samples is inevitably more likely to be selected. The formula for UCB1 is based on the theoretical limits of Hoeffding’s inequality.

UCB1-Tuned is an improved UCB1 model that considers the variance in the empirical value of each arm. The UCB score of the UCB1-Tuned is calculated using Equation (2).

(2)

considers the variance of the reward and is calculated using Equation (3).

(3)

where is the reward for arm at time .

Another extension of UCB is Bayes-UCB, which constructs upper confidence bounds based on the quantiles of the posterior distribution [25,26]. The formula for UCB1 is based on Hoeffding’s inequality, but it can also be calculated using the Chernoff-Hoeffding inequality, which provides a more precise upper bound on the probability. It is known as KL-UCB because it is computed using the Kullback–Leibler (KL) divergence [27,28]. Additionally, KL-UCB+ [29] and KL-UCB++ [30] have been proposed as improved KL-UCB models. These algorithms achieve high performance but are computationally intensive compared with UCB1. A recent study presented TS-UCB, which computes the scores for each arm using both posterior samples and confidence limits at each step [20]. In particular, TS-UCB guarantees optimal regrets and achieves a performance equal to or better than that of the state-of-the-art algorithm IDS [19]. However, IDS and TS-UCB also require a large amount of computation.

Although there are several generalizations of the UCB algorithm [31,32], the simplest is the UCT formula (Kocsis and Szepesvári 2006), which applies the UCB algorithm to a Monte Carlo tree search. In this study, we call the generalized UCB1 (G-UCB1) a simple generalization of UCB1 with one parameter, similar to UCT, and the UCB score in G-UCB1 is calculated using Equation (4).

(4)

where is a constant for adjusting the weight of the exploration amount in the range . When is multiplied by , it aligns with the UCT formula proposed by Kocsis and Szepesvári (2006). Increasing c means increasing the amount of search, which agrees with Equation (1) of UCB1 when . Equation (4) balances exploration and exploitation with a single parameter; however, our idea is to extend this balance to two parameters using generalized weighted averages. The details are explained in a subsequent section.

Problem setup

This section describes the stochastic MAB problem, which is the most common MAB problem, and the survival MAB problem, which involves the risk of ruins.

Stochastic Multi-Armed Bandit problem

The game flow for a typical MAB problem is as follows. Initially, the agent selects arm from a set of possible arms based on its strategy. Subsequently, the reward obtained by selecting arm is checked, and the arm to be selected next is sequentially determined.

The stochastic MAB problem is an MAB problem in which the rewards for arms are assumed to be generated according to a probability distribution, and the reward for arm is determined based on the reward probability set for each arm. In this study, the agent selects arm to receive reward with probability and reward with probability . The operation in which the agent selects and pulls an arm once is called a step. The algorithm is evaluated using the expected regret, which is the difference between the expected reward of the arm actually selected and the arm with the highest reward probability in all steps.

Survival Multi-Armed Bandit problem

The survival MAB problem is a recent extension of the MAB problem in which agents must maintain a positive budget throughout the process [33]. This problem is similar to the budgeted MAB problem [34], but the true risk of ruin must be considered.

The game flow of the survival MAB problem follows the same structure as the stochastic MAB problem. The key difference is that the agent starts with an initial budget and must prevent it from reaching zero to avoid ruin. In this study, if the agent chooses arm , it receives reward with probability and reward with probability . In other words, the initial budget changes to depending on the observed reward. The algorithm is evaluated using the survival rate, which is the percentage that was not ruined in each step, and the budget.

Solving the survival MAB problem involves managing the cost of exploratory behavior to avoid ruin while maximizing the budget. Therefore, the dilemma between exploration and exploitation is more complex and difficult to analyze than the usual MAB problem; however, it is a more practical problem setting. For example, stock traders have finite amounts of money and must avoid bankruptcy. All organisms in nature must avoid death while maximizing important variables specific to the organism. The purpose of this study was not to theoretically optimize regret but to present a heuristic algorithm that can be used effectively in many problem settings. Therefore, no theoretical analysis was performed in this study, but there is a study [35] that presents a Pareto optimal policy for the probability of ruin in the survival MAB problem; we refer the reader to this paper for theoretical considerations.

Proposed method

As mentioned previously, a balance between exploration and exploitation is important for maximizing the rewards obtained in the MAB problem. Averages are often used to balance two variables, and in many of these cases, arithmetic averages are used. Therefore, in this study, we extended the arithmetic mean and considered the generalized weighted averages to balance the two variables. The generalized weighted averages for variables and are expressed in Equation (5).

(5)

where takes values in the range and denotes weighting the values of and , takes values in the range and denotes the manner of taking the mean, and represents the average value calculated by the generalized weighting calculation. If and , then represents the arithmetic mean. If and , then represents the harmonic mean. For , by taking the limit , by the Maclaurin expansion, Equation (5) can be transformed into Equation (6).

(6)

Thus, for and , represents the geometric mean. Fig 1 shows an overview of the generalized weighted averages.

By applying Equation (5) for generalized weighted averages to Equation (1) of UCB1, the UCB score is expanded as shown in Equation (7).

(7)

If and , Equation (7) is consistent with Equation (1) of UCB1. We call this the UCB1 algorithm that uses the UCB score in Equation (7), or GWA-UCB1. We explicitly call this algorithm GWA-UCB1 rather than GWA-UCB because the idea can be applied to other algorithms such as UCB1-Tuned. However, in this study, we limited the application of generalized weighted averages to UCB1 to validate this idea.

Experiments

Preliminary experiments were first conducted on the stochastic MAB problem with arms to determine a set of parameters for GWA-UCB1 that is valid in many problem settings. The optimal parameters for G-UCB1 were investigated to compare algorithms. Three experiments were conducted using the set of parameters obtained from preliminary experiments. The comparison algorithms for these three experiments used UCB1, UCB1-Tuned, and Thompson sampling, in addition to the parameter sets G-UCB1 and GWA-UCB1 obtained in the preliminary experiments. The detailed setup for each experiment is as follows.

Preliminary experiment

In a preliminary experiment, the optimal parameters for G-UCB1 and GWA-UCB1 were investigated by conducting trials with steps for a stochastic MAB problem with arms. For G-UCB1, the value of parameter c in Equation (4) was shifted by a increment in the interval , and the average regret was calculated for each parameter. For GWA-UCB1, values of parameters α and m in Equation (7) were each shifted by a increment in the intervals and , and the average regret was calculated for each pair of parameters.

Experiment 1

In Experiment 1, the average regret was calculated after simulation trials with steps each for the stochastic MAB problem with arms , , and . The reward probability for each arm was determined by a uniform distribution in the interval for each trial. This experiment is the most commonly used problem setup to verify the algorithm’s performance in a stochastic MAB problem, which confirms the standard performance of the proposed method.

Experiment 2

In Experiment 1, we confirmed the performance of the algorithm with a small number of arms. However, when the number of arms is very large, it is likely that a few arms have high reward probabilities in actual situations. For example, for a Go player, one can see that although there are many options in a given scene, there are only a few options that lead to victory.

Therefore, in Experiment 2, the average regret was calculated for trials of the simulation with steps each for the number of arms , , and . The reward probability for each arm was determined by a normal distribution with a mean of and a standard deviation of for each trial. This reproduced an environment in which the reward probability for most arms was not high, around , and only some arms had high reward probability values.

Experiment 3

In Experiment 3, the survival rate and average budget were calculated after trials of step simulations for the survival MAB problem for , , and arms. The reward probability was set to for one arm and for the rest. This implies that the budget will become negative if the single best arm with cannot be searched early in the process, which puts the problem at high risk of ruin. The initial budgets for the number of arms , ,and were set to , , and , respectively.

Results

Preliminary experiments showed that the parameters that give the best results in the stochastic MAB problem for arms were for G-UCB1 and and for GWA-UCB1. Fig 2 shows the results of the preliminary experiments and confirms the approximate results for the G-UCB1 and GWA-UCB1 parameters.

thumbnail
Fig 2. Preliminary experiment results.

The upper figure shows the average regret of G-CUB1, and the lower figure shows the average regret of GWA-CUB1. The average regret is the value after 1,000 trials of a 10,000-step simulation for a stochastic MAB problem with k = 2 arms. The reward probability for each arm was determined by a uniform distribution on the [0,1] interval for each trial.

https://doi.org/10.1371/journal.pone.0322757.g002

Fig 3 shows the results of Experiment 1. For and k = arms, GWA-UCB1 exhibited the best performance. For arms, Thompson sampling performed the best. Therefore, when the reward probabilities of the arms were determined from a uniform distribution, as the number of arms increased, the performance difference between Thompson sampling and GWA-UCB1 narrowed, and Thompson sampling outperformed other algorithms. However, for a small number of arms , at least , GWA-UCB1 exhibited excellent performance. Therefore, GWA-UCB1 can be effectively used to make the best choice from a small number of options, such as A/B testing, which is performed in internet marketing.

thumbnail
Fig 3. Experiment 1 results.

The figure shows the average regret when the number of arms is k = 2, 8, and 32, from top to bottom, respectively. The average regret is the value after 100,000 trials of a 10,000-step simulation for a stochastic MAB problem. The reward probability for each arm was determined by a uniform distribution on the [0,1] interval for each trial.

https://doi.org/10.1371/journal.pone.0322757.g003

Fig 4 shows the results of Experiment 2. For all cases with , , and arms, GWA-UCB1 exhibited the best performance. In other words, when the reward probabilities of the arms were determined by a normal distribution with mean and standard deviation , GWA-UCB1 showed superior performance regardless of the number of arms . Therefore, we believe that GWA-UCB1 can be effectively used in reinforcement learning tasks, such as Go and video games, where the number of correct answer options is small despite the large number of actions that can be selected.

thumbnail
Fig 4. Experiment 2 results.

The figure shows the average regret when the number of arms is k = 32, 128, and 512, from top to bottom, respectively. The average regret is the value after 10,000 trials of a 50,000-step simulation for a stochastic MAB problem. The reward probability for each arm was determined by a normal distribution with a mean of 0.5 and a standard deviation of 0.1 for each trial.

https://doi.org/10.1371/journal.pone.0322757.g004

Fig 5 shows the survival rate results for Experiment 3, and Fig 6 shows the average budget results in Experiment 3. In all cases, with , , and arms, GWA-UCB1 exhibited the best performance. It is worth noting that even Thompson sampling, which showed excellent performance in the stochastic MAB problem, had a low survival rate in a very difficult setting, where the reward probability for arms was and only one arm had a reward probability.

thumbnail
Fig 5. Experiment 3 results.

The figure shows the survival rate when the number of arms is k = 8, 32, and 128, from top to bottom, respectively. The survival rate is the value after 10,000 trials of a 50,000-step simulation for a survival MAB problem. The reward probability for each arm was 0.55 for only one arm and 0.45 for the remaining arms.

https://doi.org/10.1371/journal.pone.0322757.g005

thumbnail
Fig 6. Experiment 3 results.

The figure shows the average budgets when the number of arms is k = 8, 32, and 128, from top to bottom, respectively. The average budget is the value after 10,000 trials of a 50,000-step simulation for a survival MAB problem. The reward probability for each arm was 0.55 for only one arm and 0.45 for the remaining arms.

https://doi.org/10.1371/journal.pone.0322757.g006

By contrast, GWA-UCB1 maximized the budget more than the other algorithms while maintaining a high survival rate. Therefore, GWA-UCB1 can be effectively used in situations such as stock trading and casino gambling, where the budget is limited.

Discussion

The practical parameters for GWA-UCB1 investigated in the preliminary experiments were and . Parameter is a parameter that adjusts the weights of exploration and exploitation and has a similar role as parameter in G-UCB1. UCB1 corresponds to the case where and . However, for GWA-UCB1 in a stochastic MAB problem with arms, the optimal value is . This indicates a tendency to place greater emphasis on exploitation compared to UCB1.

Parameter represents the manner of taking the mean. For clarity, let us consider Equation (5) with and . When , is a convex function, and the values at both ends, and approach as increases. When , regardless of the value of . When , is a concave function and the values at both ends, and , approach as decreases. When , approaches as the value of approaches or . Fig 7 shows the values of generalized weighted averages when and . The parameter ranges from to ; as , it yields the maximum of the two values, while as , it yields the minimum. When is significantly large or small, the mean calculation method becomes less effective. Therefore, given that the proposed algorithm is based on the inherently robust UCB score, we can infer that a value of close to was most effective in our experiments. In the standard UCB1, the upper bound of regret can be evaluated using the concentration inequality. However, in GWA-UCB1, the exploration and exploitation terms are nonlinearly coupled, making it difficult to directly apply conventional methods. Specifically, the parameters and influence both terms, creating a dependency that complicates the order analysis of regret. Consequently, understanding the specific contribution of to the performance of GWA-UCB1 remains challenging, and a detailed theoretical analysis is left for future research.

thumbnail
Fig 7. Examples of μ values of generalized weighted averages with α = 0.5, y = 1x.

Values of parameters x and m in Equation (5) were each shifted by an increment of 0.01 in the intervals [0.00,1.00] and [-2.00,4.00], and the value of μ was calculated for each parameter pair.

https://doi.org/10.1371/journal.pone.0322757.g007

GWA-UCB1 showed superior performance not only in reward acquisition but also in terms of survival rate. Therefore, this heuristic, which is simple and achieves good performance with a small amount of computation, provides important perspectives in the research field of machine learning. Shinohara et al. measured how humans derive the strength of causal relationships between events from cause and effect using generalized weighted averages [36]. This result achieves a better fit than previous studies, which also suggests the usefulness of balancing the two variables using generalized weighted averages.

The optimal GWA-UCB1 parameters for the stochastic MAB problem with have been investigated herein, but the parameters could also be adjusted according to the specific environment. The case with arms was investigated in the same manner as in the preliminary experiment. As a result, and were derived. We then replicated the experiment of Russo and Van Roy [19] and compared the results of their study with those of the parameter-adjusted GWA-UCB1. Fig 8 presents the comparative results between Russo and Van Roy’s study and the GWA-UCB1 results for a stochastic MAB problem with arms. The results show that GWA-UCB1 with and achieved a slightly worse performance than IDS, but as close as possible. By contrast, even GWA-UCB1 with and achieved better performance than Thompson sampling and can be used effectively enough. Fig 9 shows the relationship between the average computation time per step and average regret at the final step. The computations were performed on an Intel® CoreTM i7-8700 CPU @ 3.20 GHz with 32 GB of RAM. Although, the computation speed of GWA-UCB1 is slightly slower than that of UCB1, it is significantly faster than IDS, demonstrating its practicality in performance and computational efficiency.

thumbnail
Fig 8. Comparative results of Russo and Van Roy’s study [

19] and GWA-UCB1. The average regret is the value after 2,000 trials of a 1,000-step simulation for a stochastic MAB problem with k = 10 arms. The reward probability for each arm was determined by a uniform distribution on the [0,1] interval for each trial.

https://doi.org/10.1371/journal.pone.0322757.g008

thumbnail
Fig 9. Relationship between average regret and average computation time.

The average regret corresponds to the regret after 2,000 trials of a 1,000-step simulation for a stochastic MAB problem with k = 10 arms, as shown in Fig 8. The average computation time per step for each algorithm was calculated as the mean over 2,000 trials.

https://doi.org/10.1371/journal.pone.0322757.g009

We also address the cost of parameter tuning. Bayesian optimization (BO) is a well-established method for sequential optimization and is particularly effective for hyperparameter tuning [37, 38,39]. Compared to grid search or random search, which do not account for previous evaluations, BO can be executed more efficiently [40]. Therefore, we employed BO to tune the parameters of GWA-UCB1 for the three experiments presented in this paper. The hyperparameter tuning aimed to maximize the average regret or average budget in the final step. In all experiments, the parameter search range was set to and , with the average value in the final step derived from trials. Fig 10 illustrates the progress of hyperparameter tuning using BO for each experiment, while Fig 11 shows the tuning results. Fig 10 indicates that in all experiments except for Experiment 1 with , scores exceed those achieved with and after approximately 50 iterations. Additionally, Fig 11 shows that optimal parameters in all experiments fall within the ranges and . Bayesian optimization was conducted on an Intel (R) Core (TM) i7-8700 CPU @ 3.20 GHz with 32 GB of RAM. With 100 iterations, the shortest run time was 12.98 h for Experiment 1 with arms, and the longest run time was 99.75 h for Experiment 2 with arms. These results can serve as guidelines for future parameter exploration.

thumbnail
Fig 10. Progress of hyperparameter tuning using Bayesian optimization for each experiment.

The horizontal axis shows the number of iterations, and the vertical axis represents the average regret or average budget at the final step. The black dashed line indicates the target threshold, corresponding to the performance of GWA-UCB1 with and .

https://doi.org/10.1371/journal.pone.0322757.g010

thumbnail
Fig 11. Results of hyperparameter tuning using Bayesian optimization for each experiment.

The color gradient represents the average regret or average budget at the final step for each optimization iteration, displayed with linear interpolation. Black dots indicate values from parameter sets sampled during optimization, and the red dashed line denotes the final optimal parameter set.

https://doi.org/10.1371/journal.pone.0322757.g011

GWA-UCB1 can be implemented by replacing Equation (1) of UCB1 with Equation (7), maintaining a computation time close to that of UCB1. Furthermore, because this is a simple modification, it is easy to implement and can be applied to various UCB-based reinforcement learning models. This study only shows empirical results for GWA-UCB1; therefore, a theoretical analysis should be conducted in the future.

Conclusion

In this study, we extended UCB1 using generalized weighted averages and proposed a new generalized UCB called GWA-UCB1, which can be performed with less computation cost. Furthermore, GWA-UCB1 is a simple modification of the UCB1 formula that is easy to implement and can be extended to a variety of UCB-based reinforcement learning models.The results showed that GWA-UCB1 with and achieved better performance than G-UCB1, UCB1-Tuned, and Thompson sampling in most problem settings and may be used in many situations.

Three main challenges must be addressed in the future: (1) Undertaking of a theoretical analysis concerning GWA-UCB1, (2) integration of GWA-UCB1 into UCB-based reinforcement learning models, coupled with the validation of its performance, and (3) application of generalized weighted averages to other machine learning tasks that must consider the trade-offs between the two variables in consideration. In the MAB problem, for example, to properly estimate the reward probability of an arm, the reward of that arm must be observed over a prolonged period. However, if an arm’s reward probability is nonstationary, it must be evaluated over a short period. By applying generalized weighted averages to address such a dilemma, it may be possible to construct an MAB algorithm that is effective even in nonstationary environments.

Acknowledgments

We thank Editage [http://www.editage.com] for English language editing.

References

  1. 1. Robbins H. Some aspects of the sequential design of experiments. Bull Amer Math Soc. 1952;58:527–535.
  2. 2. Sutton RS, Barto AG. Introduction to reinforcement learning, 135. Cambridge: MIT Press; 1998. p. 223–260.
  3. 3. Xu M, Qin T, Liu TY. Estimation bias in multi-armed bandit algorithms for search advertising. Adv Neural Inf Process Syst; 2013. p. 26.
  4. 4. Schwartz EM, Bradlow ET, Fader PS. Customer acquisition via display advertising using multi-armed bandit experiments. Marketing Science. 2017;36:500–522.
  5. 5. Nuara A, Trovò F, Gatti N, Restelli M. A combinatorial-bandit algorithm for the online joint bid/budget optimization of pay-per-click advertising campaigns. AAAI. 2018;32.
  6. 6. Elena G, Milos K, Eugene I. Survey of multi-armed bandit algorithms applied to recommendation systems. Int J Open Inf Technol. 2021;9:12–27.
  7. 7. Silva N, Werneck H, Silva T, Pereira AC, Rocha L. Multi-armed bandits in recommendation systems: A survey of the state-of-the-art and future directions. Expert Syst Appl. 2022;197:116669.
  8. 8. Kocsis L, Szepesvári C. Bandit based monte-Carlo planning. In: European conference on machine learning. Berlin, Heidelberg: Springer Berlin Heidelberg; 2006. p. 282-293.
  9. 9. Moraes R, Mariño J, Lelis L, Nascimento M. Action Abstractions for Combinatorial Multi-Armed Bandit Tree Search. AIIDE. 2018;14:74–80.
  10. 10. Świechowski M, Godlewski K, Sawicki B, Mańdziuk J. Monte Carlo tree search: A review of recent modifications and applications. Artif Intell Rev. 2023;56:2497–2562.
  11. 11. Auer P, Cesa-Bianchi N, Fischer P. Finite-time analysis of the multiarmed bandit problem. Mach Learn. 2002;47: 235–256.
  12. 12. Thompson WR. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika. 1933;25: 285–294.
  13. 13. Chapelle O, Li L. An empirical evaluation of Thompson sampling. Adv Neural Inf Process Syst; 2011. p. 24.
  14. 14. Kaufmann E, Korda N, Munos R. Thompson sampling: An asymptotically optimal finite-time analysis. In: International conference on algorithmic learning theory. Berlin, Heidelberg: Springer Berlin Heidelberg; 2012b. p. 199-213.
  15. 15. Bubeck S, Liu CY. Prior-free and prior-dependent regret bounds for Thompson sampling. Adv Neural Inf Process Syst. 2013;26:1–9.
  16. 16. Russo D, Van Roy B. An information-theoretic analysis of Thompson sampling. J Mach Learn Res. 2016;17:2442–2471.
  17. 17. Auer P, Jaksch T, Ortner R. Near-optimal regret bounds for reinforcement learning. Adv Neural Inf Process Syst. 2008; p. 21.
  18. 18. Osband I, Russo D, Van Roy B. Adv Neural Inf Process Syst. (More) efficient reinforcement learning via posterior sampling; 2013. p. 26.
  19. 19. Russo D, Van Roy B. Learning to optimize via information-directed sampling. Oper Res. 2018;66:230–252.
  20. 20. Baek J, Farias V. TS-UCB: Improving on Thompson sampling with little to no additional computation. In: International Conference on Artificial Intelligence and Statistics. PMLR; 2023. p. 11132-11148.
  21. 21. Jin C, Allen-Zhu Z, Bubeck S, Jordan MI. Is Q-learning provably efficient? Adv Neural Inf Process Syst. 2018:31.
  22. 22. Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, et al. Mastering the game of Go without human knowledge. Nature. 2017;550:354–359. pmid:29052630
  23. 23. Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, et al. A general reinforcement learning algorithm that masters chess shogi and Go through self-play. Science. 2018;362:1140–1144. pmid:30523106
  24. 24. Schrittwieser J, Antonoglou I, Hubert T, Simonyan K, Sifre L, Schmitt S, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature. 2020;588:604–609. pmid:33361790
  25. 25. Kaufmann E, Cappé O, Garivier A. On Bayesian upper confidence bounds for bandit problems. In: Artif Intell Stat. 2012a. p. 592-600.
  26. 26. Kaufmann E. On Bayesian index policies for sequential resource allocation. Ann Statist. 2018;46:842–865.
  27. 27. Garivier A, Cappé O. The KL-UCB algorithm for bounded stochastic bandits and beyond. In: Proceedings of the 24th annual conference on learning theory. JMLR Workshop and Conference proceedings; 2011. p. 359-376.
  28. 28. Cappé O, Garivier A, Maillard O-A, Munos R, Stoltz G. Kullback–Leibler upper confidence bounds for optimal sequential allocation. Ann Statist. 2013;41:1516–1541.
  29. 29. Garivier A, Lattimore T, Kaufmann E. On explore-then-commit strategies. Adv Neural Inf Process Syst. 2016:29.
  30. 30. Ménard P, Garivier A. A minimax and asymptotically optimal algorithm for stochastic bandits. In: International Conference on Algorithmic Learning Theory. PMLR; 2017. p. 223-237.
  31. 31. Liu G, Shi W, Zhang K. An upper confidence bound approach to estimating coherent risk measures. Winter Simulation Conference (WSC). 2019:914–925.
  32. 32. Korkut M, Li A. Disposable Linear Bandits for Online Recommendations. AAAI. 2021;35:4172–4180.
  33. 33. Perotto FS, Bourgais M, Silva BC, Vercouter L. Open problem: Risk of ruin in multiarmed bandits. In: Conference on Learning Theory. PMLR; 2019. p. 3194-3197.
  34. 34. Tran-Thanh L, Chapman A, Munoz de Cote EM, Rogers A, Jennings NR. Epsilon–First Policies for Budget–Limited Multi-Armed Bandits. AAAI. 2010;24(1):1211–16.
  35. 35. Riou C, Honda J, Sugiyama M. The survival bandit problem. arXiv preprint arXiv:2206.03019. 2022.
  36. 36. Shinohara S, Manome N, Suzuki K, Chung U-I, Takahashi T, Gunji P-Y, et al. Extended Bayesian inference incorporating symmetry bias. Biosystems. 2020;190:104104. pmid:32027940
  37. 37. Feurer M, Klein A, Eggensperger K, Springenberg J, Blum M, Hutter F. Efficient and robust automated machine learning. Adv Neural Inf Process Syst. 2015;28.
  38. 38. Wu J, Toscano-Palmerin S, Frazier PI, Wilson AG. Practical multi-fidelity Bayesian optimization for hyperparameter tuning. In: Proceedings of the Conference on Uncertainty in Artificial Intelligence. PMLR; 2020. p. 788-798.
  39. 39. Victoria AH, Maragatham G. Automatic tuning of hyperparameters using Bayesian optimization. Evolving Syst. 2021;12(1):217-223.
  40. 40. Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res. 2012;13(2).