Optimizing the depth and the direction of prospective planning using information values

Evaluating the future consequences of actions is achievable by simulating a mental search tree into the future. Expanding deep trees, however, is computationally taxing. Therefore, machines and humans use a plan-until-habit scheme that simulates the environment up to a limited depth and then exploits habitual values as proxies for consequences that may arise in the future. Two outstanding questions in this scheme are “in which directions the search tree should be expanded?”, and “when should the expansion stop?”. Here we propose a principled solution to these questions based on a speed/accuracy tradeoff: deeper expansion in the appropriate directions leads to more accurate planning, but at the cost of slower decision-making. Our simulation results show how this algorithm expands the search tree effectively and efficiently in a grid-world environment. We further show that our algorithm can explain several behavioral patterns in animals and humans, namely the effect of time-pressure on the depth of planning, the effect of reward magnitudes on the direction of planning, and the gradual shift from goal-directed to habitual behavior over the course of training. The algorithm also provides several predictions testable in animal/human experiments.

In the case of normally distributed value functions, it is possible to further expand the vur as defined in Eq M.18 1 . Let us denote the indices of the best and the second-best actions by α and β. respectively. That is, Then we can transform Eq M.18 into We also have , we can use the properties of truncated normal distributions to get a closed form solution. Let us first define Ai . Then we get, Finally, combining these, we obtain Higher variance means higher vur Here, we show that vur(A i |F ) decreases monotonically as σ i decreases. Let us focus on the second condition of vur as defined in Eq M.9. Let and The same also holds for the first condition of vur, which we omit here.

Asymptotic complexity of vur-greedy
Here, we show that computing the vur-greedy policy requires a constant number of operations asymptotically in expectation, under some mild conditions. That is, as the search tree grows larger, the expected cost of finding the best strategy remains constant asymptotically. Assume the search tree has a constant branching factor b such that b > 1, and has a height (or depth) of h. Also assume that all the leaves are at the same level-i.e., there are b h leaves in total. Note that, while computing the vur for all of the leaves, we need µ α for b h − 1 nodes, and µ β for 1 node. That is, if µ α changes after an expansion, vur values for all the leaves need to be recomputed. If not, then only a single (or two) recomputations are sufficient. One for the expanded strategy, and one for µ β if the expanded strategy turns out to have the second highest mean.
Assume A is the strategy we consider expanding, which would result in A * . Then we can calculate the expected number of recomputations as a function of depth after expanding A as, where the const term in the above equation contains the terms that do not depend on h. Let x be the distance between µ A and µ α , i.e. x = µ α − µ A . Then, we have, σ A will decrease as h increases because of the discounting factor γ h . Assume σ A = γ h σ. Then, we have, Note that this term is positive as both the numerator and the denominator are negative. Thus, we can take the log of the term inside the limit, (21) We can see that, Given we can exchange the limit with the logarithm, we see Therefore, we only need a constant number of operations on average to evaluate arg max A vur(A|F ) in the limit of the search tree depth.

On considering vur values independently
As mentioned in the main text, one possible limitation of our pruning proposal is that it might be a better idea to search over the combinations of possible expansions, rather than considering expansions in isolation. That is, to expand instead of arg max Here, we give a partial argument why this likely is not the case. If it can be shown that resolving uncertainties of two (or more) strategies at the same time is more valuable than the value obtained by considering those strategies separately and sequentially, this would imply action selection via Eq M.25 leaves some information value on the table. Let us consider a simple case, where we consider expanding two strategies at once, which yields where we assume i ̸ = α and j ̸ = α to avoid complications caused by the piecewise nature of vur. However, the argument holds for the general case as well.

Is it possible that vur(
is the expected information value of expanding A j after expanding A i ? The answer is no, as we will show in the following argument.
Let us now define the conditional vur, which in this case assumes A i is expanded and a µ * i that is a more accurate estimate of However, since we would like to assess this conditional value prior to expanding A i , the measure of our concern is, Therefore, This can be seen as a chain rule for vur and implies that there isn't anything to gain from expanding two strategies at once rather then expanding them sequentially. In fact, simultaneous expansion is in practice worse, since sequential expansion enables changing ones mind about what the best strategy to expand is after the expansion of another strategy.

Conjugate normal prior
In both of our experiments, we use conjugate normal priors to estimate Q-value distributions of the agent. We assume the true state-action values, Q ‡ , are distributed according to a known normal distribution. Formally, Q ‡ (s, a) ∼ N (µ 0 , σ 2 0 ) i.i.d. for all ⟨s, a⟩ with known prior hyperparameters µ 0 and σ 0 . The agent estimates would like to estimate Q ‡ , but she can only observe imperfect samples of it, in terms of cumulative trajectory returns in our case, R sa ∼ N ( Q ‡ (s, a), σ 2 ) i.i.d., where R sa is the cumulative discounted reward of a trajectory. Instead of assuming a fixed σ, we use the empirical standard deviation of the returns (i.e., of R sa 's). One can easily show that this is equivalent to using a Normal-Gamma prior (as in [1]) and choosing the σ with the highest likelihood in the limit of the sample size. Then the Q(s, a) used in our formulation is essentially the posterior Q ‡ (s, a) condition on a sequence of trajectory rewards.