Bayesian optimization for computationally extensive probability distributions

An efficient method for finding a better maximizer of computationally extensive probability distributions is proposed on the basis of a Bayesian optimization technique. A key idea of the proposed method is to use extreme values of acquisition functions by Gaussian processes for the next training phase, which should be located near a local maximum or a global maximum of the probability distribution. Our Bayesian optimization technique is applied to the posterior distribution in the effective physical model estimation, which is a computationally extensive probability distribution. Even when the number of sampling points on the posterior distributions is fixed to be small, the Bayesian optimization provides a better maximizer of the posterior distributions in comparison to those by the random search method, the steepest descent method, or the Monte Carlo method. Furthermore, the Bayesian optimization improves the results efficiently by combining the steepest descent method and thus it is a powerful tool to search for a better maximizer of computationally extensive probability distributions.


Introduction
Bayesian optimization [1][2][3][4][5] has recently attracted much attention as a method to search the maximizer/minimizer of a black-box function in informatics and materials science [6][7][8][9][10][11][12]. In this method, the black-box function is interpolated by Gaussian processes. Then the interpolated function is used to predict the maximizer/minimizer of the black-box function. The Bayesian optimization is effective for problems where the value on the black-box function cannot be easily obtained. In other words, it is effective when the data for the black-box function is limited.
We are currently developing a generic effective physical model estimation method from experimentally measured data using machine learning, which relates to calibration in data science [13][14][15]. As the first example, we developed a method to estimate a set of model parameters x = (x 1 , ..., x K ) in the Hamiltonian H(x), where K is the number of model parameters [16]. Let y ex be the set of physical quantities {y ex (g l )} l=1,...,L depending on the external parameter g l with L being the number of data. By using Bayes' theorem, the posterior distribution P (x|y ex ), or the conditional probability of x given y ex is expressed as where P (x) and Z(y ex ) are the prior distributions of the model parameters and the normalization constant of the posterior distribution, respectively. Assuming that the observed noise follows a Gaussian distribution with a mean of zero and a standard deviation of σ, the likelihood function P (y ex |x) is given as where {y cal (g l , x)} l=1,··· ,L is the g l dependence of the physical quantity calculated from H(x), and hereinafter let y cal (x) be the set of {y cal (g l , x)} l=1,··· ,L . Then, the posterior distribution is expressed as where the "energy function" as a function of x is given by From the viewpoint of the maximum a posterior (MAP) estimation, the plausible model parameters for explaining y ex are obtained as the maximizer of Eq. (3) or the minimizer of Eq. (4). Thus, the most fundamental task for construction of an effective model is summarized to maximize Eq. (3) or minimize Eq. (4). A computational method to evaluate the posterior distribution or energy function consists of a double-loop calculation. In the inner loop, the physical quantities y cal (x) are calculated from H(x) when a set of model parameters is given. The computational cost of the inner loop depends on the simulation method used to calculate y cal (x). As discussed in Ref. 16, the steepest descent method is a promising way for this calculation when the input data is assumed to be explained by a simple classical Hamiltonian at the zero temperature. Evaluating y cal (x) for the given H(x), in general, requires statistical/quantum mechanical many-body calculations, such as the Markov-chain Monte Carlo (MCMC) method [17][18][19][20][21], the exact diagonalization method [22][23][24], and the density matrix renormalization group method [25][26][27]. All of which drastically increase the computational cost for the inner loop.
In the outer loop, a sampling of a model parameter x in the posterior distribution is performed. In Ref. 16, we used the MCMC method with the exchange Monte Carlo method [28]. Although this combined method efficiently yields the global maximum of the probability distribution even when many local maxima exist, an enormous number of sampling points is very time consuming. Consequently, the MCMC approach to calculate the outer loop of a complicated effective model estimation is one of the main obstacles for applications in material science.
In this paper, a computational method that estimates the effective model with a reduced outer loop computational cost is discussed on the basis of a Bayesian optimization for computationally extensive probability distributions. In the our Bayesian optimization technique, extreme values of acquisition functions obtained by Gaussian processes are used as candidates of maximizers of Eq. (3) or minimizers of Eq. (4). We investigate the efficiency of our Bayesian optimization technique to search the minimizer of E(x) defined by Eq. (4) relative to the random search method, the steepest descent method, and the Monte Carlo method when the number of sampling points is fixed to be small. In our demonstrations, the magnetization curve from the classical Ising model calculated by the mean-field approximation and the specific heat from the quantum Heisenberg model calculated by the exact diagonalization method are treated as the inputted measured data. Consequently, it is found that the Bayesian optimization is useful to search a better maximizer of the computationally extensive probability distribution.

Gaussian process
The Gaussian processes are a powerful machine learning technique to estimate unknown data from known data sets [29]. Here we consider the case when the given data set is {x n , E(x n )} n=1,...,N , where N is the number of data. In our case, x n is the set of model parameters in the effective physical model and E(x n ) denotes the value of the energy function E(x) defined by Eq. (4) on x n . Using Gaussian processes which are zero mean, the conditional probability P (E(x)|x) of E(x) given any x is written as the Gaussian distribution with a mean of µ(x) and a standard deviation of δ(x): where I N is the N -dimensional identity matrix. Furthermore, E, k, K, and c are defined as where k(x i , x j ) is the Gauss kernel function: Although the computational cost of Gaussian processes is O(N 3 ), some methods to reduce it including an approximation method and their efficiencies are currently under investigation [30,31]. In this formula, λ and γ are the hyperparameters, which should be specified prior to the analysis. While various methods have been proposed to determine the hyperparameters, we adopt the cross validation method for determination of the hyperparameters λ and γ, which are chosen so as to minimize the prediction error. In the cross validation, the data set D, that is, {x n , E(x n )} n=1,...,N is randomly divided into S data subsets. Each data subset is expressed by D s labeled by s = 1, ..., S. One of the S data subsets is regarded as the testing data, while the remaining S − 1 subsets are used as training data. For each data subset G s = D \ D s , Gaussian process training is performed when the training data are {x n , E(x n )} n∈Gs . The mean-square error between the testing data E(x n ) and the estimated µ(x n ) for n ∈ D s is evaluated. The cross validation regards the mean-square error as the prediction error when the testing data D s is treated as unknown data. The optimal values of λ and γ are evaluated to minimize the prediction error averaged over S data subsets.

Bayesian optimization for computationally extensive probability distributions
We introduce a Bayesian optimization technique to find a better minimizer for the energy function E(x) defined by Eq. (4), when the number of sampling points is limited. Our Bayesian optimization is comprised of the following procedure: Step 1: Sets of model parameters x n are randomly generated with n = 1, ..., P , and E(x n ) is calculated for the generated x n . That is, the P calculations of y cal (x n ) from H(x n ) are necessary.
Step 2: Gaussian process is trained for the data set {x n , E(x n )} n=1,...,P , yielding the mean value µ(x) and the standard deviation δ(x) of P (E(x)|x).
Step 3: The steepest descent method with randomly chosen initial parameters is performed for the three types of acquisition functions [32][33][34][35] defined as where κ > 0 and 0 < ǫ ≤ 1 are the hyperparameters. |X| is the size of the search space, and t is the step of repetition of BO. Furthermore, φ(Z) and Φ(Z) are the standard normal probability distribution function and its cumulative distribution function, respectively, and E min is the present minimum value of E(x). Then, a local or global minimum x * of acquisition functions is obtained and Q different model parameters are generated by repeating this operation. Note that the fixed value of ǫ as 0.5 is used in the analysis of this paper for simplicity.
Step 4: E(x * ) is calculated for each x * obtained in Step 3. By adding the new data, the data set is updated as {x n , E(x n )} n=1,...,P +Q . Here, the Q calculations of y cal (x n ) from H(x n ) are necessary.
Step 5: Steps 2-4 are repeated R times. In each iteration, the number of data points is increased by Q evaluation.
We emphasize that the number of calculations of y cal (x n ) from H(x n ) is N s = P + Q × R in this procedure, which corresponds to the number of sampling points on E(x). The computational cost in Step 3 is low because µ(x) and δ(x) are quickly obtained for a given x. Thus, many candidates for a local minimum or a global minimum of E(x) are generated from the acquisition functions without calculation of E(x), which is the key of our Bayesian optimization. Notice that an alternative approach has been proposed for optimizing a continuous function with an easily-calculable statistical function defined only on discrete grid points, in contrast to the our method [36].

Application for posterior distribution based on a classical Ising model
We demonstrate an application for posterior distribution in effective physical model estimation based on a classical Ising model in two dimensions. The model Hamiltonian of the classical Ising model under magnetic field H is defined by where J ij is the exchange interactions between the i-th spin and the j-th spin. Here, we consider three types of exchange interactions on the square lattice shown in Fig 1  (a). In this case, three different model parameters are to be estimated, that is, To discuss the efficiency of the proposed method for the effective model estimation, a synthesis magnetization curve {m ex (H l )} l=1,...,L is used as the input data generated by the same model of Eq. (15). By performing mean-field calculations for the four sublattice model, the magnetic field dependency of the magnetization is calculated with (x 1 , x 2 , x 3 ) = (−1.0, −0.5, 0.3) for a temperature T = 3.0. Here, the Boltzmann constant is set to unity and the physical energy unit is set to |J 1 |. Gaussian noise with a mean of zero and a standard deviation of 0.004 is added to the obtained magnetization curve . Fig 1 (b) shows the inputted magnetization curve {m ex (H l )} l=1,...,L where the number of data points is L = 200.
To estimate the effective model from {m ex (H l )} l=1,...,L , we search the maximizer of the posterior distribution, which is defined as where {m cal (H l , x)} l=1,...,L is the set of calculated magnetization curves from H C (x). In this demonstration, the mean-field calculations for the four sublattice model are used as the inner loop calculation method to obtain {m cal (H l , x)} l=1,...,L . Furthermore, instead of treating the posterior distribution itself, the minimizer of E C (x) is searched. For simplicity, the prior distribution of model parameters P (x) is assumed to be a uniform distribution; that is, P (x) = 1 which corresponds to the least square fitting, and then the factor 1/2σ 2 is set to be a constant without loss of generality.
The minimum values of E C (x) obtained by the random search method, the steepest descent method, the Monte Carlo method, and the our Bayesian optimization are compared, depending on the number of sampling points N s on E C (x). The details of each method are denoted below.
Random search method: A set of model parameters x n = (x 1 , x 2 , x 3 ) is randomly generated from the region where −5 ≤ x 1 , x 2 , x 3 ≤ 5. Then E C (x n ) is calculated. This procedure is repeated N s times, and the data set {x n , E C (x n )} n=1,...,Ns is obtained, from which the minimum value of E C (x) is searched.
Steepest descent method: An initial set of model parameters [i.e. x 1 = (x 1 , x 2 , x 3 )] is randomly generated from the region where −5 ≤ x 1 , x 2 , x 3 ≤ 5. A set of model parameters is updated N s /2 times by using the following equation from x n = (x 1 , ..., x k , ..., x K ) to Here, k is randomly chosen from k ∈ 1, ..., K where K = 3 in this case, and ∆x = α = 0.01. Notice that the calculation of E C (x) should be repeated twice in each update. Thus, when the number of updates is N s /2, the number of sampling points on E C (x) becomes N s . Using this update of the model parameters, E C (x) decreases for each update. From the obtained {x n , E C (x n )} n=1,...,Ns/2 , the minimum value of E C (x) is searched.
Monte Carlo method: An initial set of model parameters [i.e. x 1 = (x 1 , x 2 , x 3 )] is randomly generated from the region where −5 ≤ x 1 , x 2 , x 3 ≤ 5. A set of model parameters is updated N s times using the following Metropolis-type transition probability from x n to x n+1 : Here, the set of model parameters after updating is prepared as x n+1 = (x 1 , ..., x ′ k , ..., x K ) with x ′ k = x k + r from the set of model parameters before updating x n = (x 1 , ..., x k , ..., x K ), where k is randomly chosen from k ∈ 1, ..., K, and r is a random number between −1 and +1. From the obtained {x n , E C (x n )} n=1,...,Ns , the minimum value of E C (x) is searched.
Bayesian optimization: A set of model parameters x n is randomly generated from the region where −5 ≤ x 1 , x 2 , x 3 ≤ 5 and E C (x n ) is calculated. This procedure is repeated P = 200 times as the initial data set, and the Bayesian optimization is performed with Q = 10 and R = (N s − P )/Q. In the method, the steepest descent method in Step 3 is implemented by using the following equation from x = (x 1 , ..., x k , ..., x K ) to x ′ = (x 1 , ..., x ′ k , ..., x K ): 6/13 where f (x) expresses the acquisition functions defined by Eqs. (12), (13), and (14). Here, k is randomly chosen from k ∈ 1, ..., K, and ∆x = α = 0.01. f (x) is defined by Eq. (12), which is obtained from Gaussian process. In our calculation, the steepest descent method is performed with 100 updates to obtain the extreme value of f (x). From the obtained {x n , E C (x n )} n=1,...,Ns , the minimum value of E C (x) is searched .   Fig 2 (a) is the sampling number N s dependence of the averaged minimum value E av of E C (x) for 100 independent runs with each methods. The error bars are calculated from the standard deviation. The Bayesian optimization yields the smallest E av , indicating that the Bayesian optimization gives better minimizers of E C (x) even if N s is small. Furthermore, the most successful analysis is given by the Bayesian optimization using f (x) LCB with κ = 20, while the steepest descent method and the Monte Carlo method produce worse results than the random search method. These methods are frequently trapped at a local minimum depending on the initial set of model parameters, and eventually E av stays at large values.  Fig 3 (a) is the distribution of the estimated model parameters for 100 independent runs with various N s by the random search method and the Bayesian optimization. The black lines indicate exact solutions by which the input magnetization curve without Gaussian noise is generated, except for the case where any one of the parameters x k has zero. As N s increases, the results by the Bayesian optimization converge on the black lines, implying that the model parameters can be correctly estimated with a high probability. On the other hand, the case of the random search method shows no significant improvement with increasing N s . This could be 7/13 understood by noticing that the accuracy of the acquisition functions by Gaussian processes in the Bayesian optimization is improved with increasing the sampling points, namely N s , while the random search method does not refer to the prior sampling points. The Bayesian optimization as well as the random search method, in general, does not take into account local structure of the energy function such as gradient in the parameter space. To improve the solutions, we consider combinations of the steepest descent method with the random search method or the Bayesian optimization. One may expect that the steepest descent method produces a local minimum or a global minimum around the estimated model parameters by the random search method or the Bayesian optimization. That is, the estimated model parameters by the random search method or the Bayesian optimization are used as the initial set of model parameters in the steepest descent method, which is performed with 50 updates . Fig 2  (b) compares E av 's by the random search method, the Bayesian optimization using f LCB (x) with κ = 20, and those with the steepest descent method. The drastic improvement can be confirmed even for 50 updates in the steepest descent method. Note that if the number of updates in the steepest descent method is increased, the obtained E av should be improved. However, since the number of sampling is also increased, a trade-off between search for initial sets by the random search method or the Bayesian optimization and evaluation of local structures by the steepest descent method should be optimized. We confirmed for some cases that the Bayesian optimization with steepest descent method is the best among the considered methods. Fig 3 (b) shows the distribution of the estimated model parameters. For the Bayesian optimization with the steepest descent method, the real minimizer of E C (x) is found in all independent runs, while some of the obtained results by the random search method with the steepest descent method differ from the exact solutions, and these cases are trapped in local minima. The steepest descent method significantly improves the estimates by the Bayesian optimization and random search methods. The results imply that the Bayesian optimization combined with the steepest descent method is powerful tool to find the global minimum of E C (x).

Application for posterior distribution based on a quantum Heisenberg model
The case where the number of model parameters increases against the previous case is considered when a quantum Heisenberg model on the one-dimensional chain is used (Fig 4 (a)). The model Hamiltonian of the quantum Heisenberg model under magnetic field H is defined by where ∆ is the parameter for the anisotropy and (σ x i ,σ y i ,σ z i ) is the Pauli matrix. Here, the model parameters are x = (x 1 , x 2 , x 3 , x 4 , x 5 ) = (J 1 , J 2 , J 3 , ∆, H).  This demonstration uses the temperature dependence of the specific heat as an input data. The input specific heat {C ex (T l )} l=1,...,L is generated from the model defined by Eq. (24) as follows. By performing the exact diagonalization method, the temperature dependence of the thermal average of the specific heat for (x 1 , x 2 , x 3 , x 4 , x 5 ) = (1.0, 0.8, −0.2, −0.7, 0.3) is calculated. The Gaussian noise with a mean of zero and a standard deviation of 0.004 is added to the obtained specific heat. Fig 4 (b) shows the temperature dependence of the specific heat with L = 200, which is used as the input in the effective model estimation. As shown in the previous case, our task is to search for the minimizer of energy function E Q (x) defined as where {C cal (T l , x)} l=1,...,L is the set of calculated specific heat from H Q (x) by performing the exact diagonalization method.

9/13
We compared E av , which is the average of the minimum value of E Q (x) for 100 independent runs, for the random search method, the steepest descent method, the Monte Carlo method, and the Bayesian optimization (Fig 5 (a)). The setups of these methods are the same as the previous case except for the number of model parameters (K = 5) and the region in which a set of model parameters is randomly generated. In this case, we use −3 ≤ x 1 , x 2 , x 3 ≤ 3 and −2 ≤ x 4 , x 5 ≤ 2. The results are qualitatively the same as the previous case. The most successful analysis is produced by the Bayesian optimization using f (x) EI . This result is different from the previous demonstration, which means that an appropriate acquisition function depends on a target physical model and input physical quantities. Furthermore, as shown in Fig 5  (b), the combined steepest descent method improves the estimates of the Bayesian optimization and the random search method again. Similar to the previous case, the Bayesian optimization with the steepest descent method gives a better minimizer of E Q (x). Consequently, we conclude that the Bayesian optimization is useful to find a better maximizer of the posterior distribution in an effective model estimation with a small number of sampling points. (a) E av as a function of N s obtained from the random search method (red circles), the steepest descent method (yellow circles), the Monte Carlo method (green circles), and the Bayesian optimization (blue circles). (b) E av as a function of N s obtained from the random search method (RS) (red circles), the Bayesian optimization using f EI (x) (BO) (blue circles), the random search method with the steepest descent method (RS+SD) (red diamonds), and the Bayesian optimization with the steepest descent method (BO+SD) (blue diamonds). In the steepest descent method, 50 updates are performed after RS or BO.

Discussion
We searched for a better maximizer of a posterior distribution in the effective physical model estimation which is a computationally extensive probability distribution, using the Bayesian optimization. It is found for at least two simple models that the Bayesian optimization has a higher efficiency of finding a better maximizer of the posterior distribution compared to the random search method, the steepest descent method, and the Monte Carlo method when the number of sampling points on the posterior distribution is fixed to be small, while an appropriate acquisition function providing a high efficiency still depends on the problem to be solved. Our Bayesian optimization has some hyperparameters, i.e., P, Q, and R. Although we did not optimize these hyperparameters, the Bayesian optimization is a better method to obtain the maximizer of the posterior distribution. Particularly, since the value of Q is related to the batch/parallel problem of the Bayesian optimization [37,38], some improvement of the performance is expected by tuning Q. Furthermore, a combination of the Bayesian optimization and the steepest descent method drastically increases the efficiency of finding a better maximizer of the posterior distribution. The key of our Bayesian optimization is to predict a set of model parameters near a local maximum or a global maximum of the posterior distribution from the extreme values of acquisition functions by Gaussian processes, which requires a relatively low computational cost. Consequently, the model parameters near a global maximum can be found with a high probability. These facts suggest that the Bayesian optimization will be a powerful tool for effective model estimations. However, to find a maximizer of posterior distributions with various types of prior distributions and a large number of model parameters, the Bayesian optimization may be not always useful. Then in the future, we will evaluate effective model estimations using the Bayesian optimization for actual materials. Because the maximizer of a probability distribution is searched in many scientific fields, the Bayesian optimization will play an important role in the promotion of science.