Variational quantum classifiers through the lens of the Hessian

In quantum computing, the variational quantum algorithms (VQAs) are well suited for finding optimal combinations of things in specific applications ranging from chemistry all the way to finance. The training of VQAs with gradient descent optimization algorithm has shown a good convergence. At an early stage, the simulation of variational quantum circuits on noisy intermediate-scale quantum (NISQ) devices suffers from noisy outputs. Just like classical deep learning, it also suffers from vanishing gradient problems. It is a realistic goal to study the topology of loss landscape, to visualize the curvature information and trainability of these circuits in the existence of vanishing gradients. In this paper, we calculate the Hessian and visualize the loss landscape of variational quantum classifiers at different points in parameter space. The curvature information of variational quantum classifiers (VQC) is interpreted and the loss function’s convergence is shown. It helps us better understand the behavior of variational quantum circuits to tackle optimization problems efficiently. We investigated the variational quantum classifiers via Hessian on quantum computers, starting with a simple 4-bit parity problem to gain insight into the practical behavior of Hessian, then thoroughly analyzed the behavior of Hessian’s eigenvalues on training the variational quantum classifier for the Diabetes dataset. Finally, we show how the adaptive Hessian learning rate can influence the convergence while training the variational circuits.


Introduction & motivation
In recent years, the enhancement of machine learning algorithms by noisy intermediate-scale quantum (NISQ) technology and mainly the variational quantum circuits have garnered significant attention among academic and research communities [1]. Researchers have applied the variational quantum algorithms in various applications in the NISQ era, mainly the ones related to quantum artificial intelligence. The variational quantum algorithms (VQAs) have shown great learning capability to counterbalance the errors in the device framework. They are considered to be the greatest hope for the journey toward quantum advantage. The first variational quantum eigensolver (VQE) was proposed as a state ansatz to determine the a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 ground state energy of physical systems [2]. Since the first VQE introduced, several VQE variants have been proposed with a plethora of alterations for computation of excited states such as orthogonality constrained VQE [3], subspace approach VQE [4,5], adiabatically assisted VQE [6] and multistate contracted VQE [7]. These variational quantum circuits are constructed with an ansatz whose parameters are trained with several optimization methods to minimize the cost. For combinatorial optimization tasks, the quantum approximate optimization algorithm (QAOA) was proposed originally to attain approximate solutions [8]. These architectures of variational quantum circuits have shown to be computationally universal [9,10].
Recently, the use of VQAs has got an incredible response and is widely applied in quantum machine learning applications. VQAs consist of a small number of qubits and quantum circuits, which make them resistant to noise. The VQAs are effective for a classification task in machine learning; the objective is to train a classifier and predict the label of each input accurately [11][12][13]. Suppose, a training data is given {x i , y i }, where x 0 i s and y 0 i s are inputs and labels respectively. The variational quantum circuit is used as a black box to predict the right label (y i ) for each input after embedding the classical data into the quantum states. VQAs are the quantum variants of neural networks, the most commonly used and highly successful machine learning model. Till now, several architectures for quantum neural networks (QNNs) have been proposed and applied in different areas [14][15][16][17]. Recently, Cong et al. [18] introduced the quantum convolutional neural networks (QCNNs) and used them to discriminate quantum states of distinct topological phases. Bhatia et al. [19] performed the simulation of several entangled states on a quantum computer. Romero et al. [20] proposed a quantum variational autoencoder for compressing the quantum data efficiently. Pepper et al. [21] demonstrated the experimental realization of a proposed quantum autoencoder, which will likely be an essential primordial in quantum machine learning. In recent years, generative adversarial networks (GANs) have been an exciting topic of research in classical machine learning. Romero and Aspuru-Guzik [22] introduced the quantum variant of GAN for learning continuous distributions and to speedup classical GAN using quantum systems. Kappen [23] proposed a method to represent the classical data distribution in a quantum system. Moreover, the potential of VQAs can be evaluated in several industry based applications such as supply chain management, intelligent healthcare, smart agriculture, manufacturing production. and cloud manufacturing [24].
The major downside of variational quantum circuits is the occurrence of a barren plateau, which vanishes the gradients of cost function exponentially with the increase in the number of qubits [25,26]. It can be represented by flat plateaus of the loss function [27]. Firstly, in 2018, McClean et al. [28] studied the barren plateau phenomenon numerically for a class of random quantum circuits. It has been shown that variance of the gradients of modest size quantum circuits vanishes exponentially. Cerezo et al. [29] described a barren plateau phenomenon as a point in the parameter space and investigated it for shallow quantum neural networks (QNNs). It has been proven that global cost gradients vanish exponentially at all depths, whereas local costs exhibit non-vanishing gradients at shorter depth quantum circuits. Grant et al. [30] investigated the problem of a barren plateau of parametrized quantum circuits in the energy landscape. The approach is based on the initialization of parameters to avoid the barren plateaus. Initially, some parameter values are selected randomly, and the remaining values are chosen so that it becomes a chain of shallow circuits. The depth of circuits is used to determine the first updated parameter not to be lodged in a barren plateaus problem during training. Recently, it has investigated that the barren plateau is missing in QNNs and QCNNs with tree tensor network (TTN) architecture [31,32].
In classical machine learning, the loss landscapes of neural networks and their characterization via the lens of Hessian have been well investigated. Several works have explored the flatness of local minima using the eigenvalues of Hessian [32][33][34][35][36]. During training, the generalization ability of the neural networks depends on batch size. It has explored that minibatch size tends to favor flat minima basins of the loss function with several eigenvalues of the Hessian λ j = 0 and hardly any λ j < 0 [37]. Perez-Salinas et al. attained fast convergence for some quantum classification problems using Hessian based optimization method called batched optimization. Rebentrost et al. [38] proposed the quantum variants of two popular iterative optimization algorithms, Newton's and gradient descent. At time t, numerous copies of quantum state |ψi t are used to generate other numerous copies of quantum state |ψi t 0 at t 0 = t+1 by using the Hessian of an objective function and in consideration of a gradient vector. Recently, Huembeli and Dauphin [39] calculated the Hessian of a loss function and characterized the loss landscape of variational quantum circuits. It showed that the Hessian helps escape the flat surface of loss function for certain data-driven variational quantum circuits.
The VQCs have been extensively employed in a wide array of new applications. Fig 1 shows the schematic representation of a variational quantum circuit. It involves evaluating a cost function or its gradient on a quantum system [40,41]. A classical optimization loop trains the parameters (θ) of a variational quantum circuit V(θ) to reduce the cost. It is well-known that the feature map encodings in variational parameterized circuit architecture designs produce loss functions that train easier, and well-selected training parameters "optimizer" that generalize well. Nevertheless, the effects of parameters on the entire loss landscape are not well studied. It has not received the attention it deserves. The occurrence of barren plateaus issue could abolish the quantum advantage with a parameterized quantum circuit [9]. The visualization and understanding of the loss landscape of classical machine learning algorithms remains a vital and highly active area of research. Due to issues of computational complexity, the structure of loss landscape of variational quantum classifiers is not well visualized and recognized as compared to classical neural networks. A better visualization can really work for the advancement of optimization algorithms and can highlight the shortcomings of quantum circuit designs. Hence, it is a natural goal to study the loss landscape of variational quantum classifiers with the eigenvalues of Hessian to recognize when the quantum speedup is achievable. The following contributions are claimed: • Visualized the loss landscape of variational quantum classifiers at different points in parameter space using Hessian matrices.
• Analyzed the behavior of Hessian's eigenvalues on training the variational quantum classifier for different datasets.
• Investigated that how the adaptive Hessian learning rate can influence the convergence while training the variational circuits.
• Observed that adaptive Hessian learning rate can help to overshoot the cost if it gets stuck into local minima and converge quickly.
The eigenvectors and eigenvalues of Hessian present a clear interpretation of the loss landscape of a VQC. We started with a 4-bit parity problem to provide perception about the behavior of Hessian and then studied the VQC trained on diabetes classical data acting as a classifier. The organization of the rest of this paper is as follows: Sect. 2 is devoted to preliminaries and Hessian's computation of variational quantum classifier. In Sect. 3, we computed the Hessian on a quantum simulator and visualized the curvature information of a parity function. In Sect. 3 (B), we characterized the loss landscape (i.e., curvature information) of datadriven variational quantum classifiers via Hessian of the loss function, and the experimental results are plotted for the diabetes dataset. In Sect. 4, we show how the adaptive Hessian learning rate can help to overshoot the cost that helps it avoid getting stuck in local minima during training of variational circuits. Finally, Sect. 5 is the conclusion.

Preliminaries
In this Section, some basic concepts of loss function visualization with the Hessian matrix are given. Here, we give the background required to understand our results. Consider a realvalued function f(θ) = f(θ 1 , θ 2 , . . ., θ n ) with θ = (θ 1 , θ 2 , . . ., θ n ). The Hessian matrix (H 2 f(θ)) of f(θ) is represented as the square matrix of the second derivatives of a real-valued function of n variables, which help us to characterize the loss landscape (i.e. curvature information) [42,43]. The gradient (rf(θ)) gives the partial derivatives of the function. The Hessian operator H 2 f(θ) gives the partial derivatives of the gradient [44]. ; .
If the second derivatives are continuous, then the Hessian is symmetric. The Hessian matrix consists of information about the geometric information of the function. Its eigenvalues (λ 1 , λ 2 , . . ., λ n )'s are real and used to get this curvature information. So the only thing to examine is whether the eigenvalues λ i 's are negative or positive. Suppose, x i is an eigenvector associated with λ i , then it is represented as i th eigenpair (λ i , x i ) of H. Using the Hessian matrix, we can determine whether the certain point θ on a surface is locally positive or negative. The eigenvalues of the Hessian matrix give the directions of the derivatives. Suppose the Hessian evaluated at a given point θ consists of all positive eigenvalues λ i > 0 (i.e., positive-definite matrix). In that case, it shows a local positive curvature, and θ is a local minimum of f. Similarly, if all the eigenvalues are negative, it shows a locally negative curvature, and θ is a local maximum of f. The zero eigenvalues indicate the zero curvature of the function or flat directions. If the eigenvalues are mixed (some positive, some negative), then the surface has a saddle point θ of f. Thus, the Hessian can be used to determine the convexity and concavity of a function of one or two variables [45].

Hessian computation of VQC
In classical machine learning, the neural networks are trained over a dataset consisting of feature vectors {x i } and labels {y i } by reducing the cost function as where the l(θ, x j ) is the prediction parameterized by weights (θ) of the neural network, c(.) denotes the loss function that measures how well the neural network predicts the label by calculating its difference with the neural network prediction, and n denotes the size of data samples. The loss functions exist in a high-dimensional space due to the presence of several parameters in neural networks. Therefore, its visualization is not possible in higher-dimensional space. Analogously, the loss landscape of VQC has not been extensively examined as compared to classical neural networks. In a quantum layer, the classical data (x) is encoded into the quantum state jf ðy;xÞi using a feature map consisting of quantum gates with parameters. In classical layer, the parameterized function f ðy;xÞ is evaluated depending upon the learning parameters (θ) in variational quantum circuit on performing measurement. The tuning of parameters (θ) is executed by minimizing a loss function on a classical computer. The gradient of a quantum circuit can be evaluated by estimating the expectation value of an observable with reference to θ. It consists a series of unitary transformations. The gradient of an expectation value of an observable hf ðy;xÞi ¼ hxjV þ ðyÞfVðyÞjxi can be given as where y � j denotes the all parameters except θ j , V is a product of unitary matrices, θ i is an angle which parametrized the VðyÞ ¼ e À jy j P j =2 , where P j is a Hermitian operator with eigenvalues ±1.
Let us now define the Hessian matrix elements. The Hessian (H) of a quantum circuit can be computed by performing the parameter shift rule two times [39,[46][47][48].

Experiment settings
We have used the PyTorch library [50], and pennylane package [51] for developing and training the variational quantum classifiers. The code is written in pennylane with a large number of parameters to expedite the experiments. The implementation is performed using PyTorch to accelerate the simulation using algebraic manipulation. All the quantum simulations are performed using python framework on the PennyLane platform for quantum differentiable programming [51]. The 2D and 3D graphs are plotted using a Plotly i.e. graphing library available in python.

Warm-up example
In this section, we characterized the loss landscape curvature of the four-bit parity problem with the Hessian. We begin exploring the concept of Hessian of loss functions of VQCs with a warp-up activity of solving a four-bit parity problem. Before illustrating the variational quantum classifier, the partial function is defined as a Boolean function whose output is 1 if and only if the input vector has an uneven number of ones. It is also called the XOR function of two inputs. The n-bit parity function is given as [52] f : v 2 f0; 1g The first step is to encode the input vectors into a quantum state. In a warm-up example, the inputs are 4-bit strings that are encoded into the state of qubits. The feature mapping and variational circuit of the four-bit parity problem are shown in Fig 2. A single qubit rotation is defined as: The initial layer of R x gates prepares the initial state, which is also known as feature map circuit. Later, there are three layers of variational gates, where each layer consists of rotational gates with three trainable parameters in each of the four qubits, followed by a set of CNOT gates. It is to be noted that the combination of CNOT gates in each layer has a different structure. The variation circuit consists of 36 parameterized gates and 12 non-parameterized gates.
The input dataṽ is encoded into quantum state using feature map function as c :ṽ ! jcðṽÞihcðṽÞj. Initially, one qubit rotations are performed around x-axis. It is followed by U 1 single-qubit gate to apply a quantum phase to the qubit. Furthermore, a controlled-NOT multi-qubit gate is applied to flip the target qubit when the control qubit is in |1i. The purpose of the feature map circuit is to map the classical input data into the quantum state.
The final n-qubit feature quantum state becomes Second, a short depth quantum circuit V(θ) is applied to the feature state. It depends upon the selection of parameterization for the gates and the number of d layers. The classical optimizer handles the parameters during training to reduce the value of a loss function. Before returning a final classifier outcome, classical postprocessing is performed to the expectation value of the circuit. The aim is to determine the optimal classifying circuit V(θ) that separates the dataset with distinct labels. In variational quantum circuit, we used the R y (θ) and R z (θ) parameterized gates that are applied to rotate the qubits by angle (θ) around y-axis and z-axis, respectively. CNOT gates follow it. The objective is to find a sequence of gates that forms a final state |ψi 0 . A cost function (C f ) is defined as the square of trace distance (D t ) between final |ψi 0 and initial state |ψi = V(θ) † |0i, which is determined as where O f = 1 − |0ih0|. It is equivalent to C f = D t (|ψi 0 hψ|) 2 . As a warm-up to our study, the performance of a variational quantum classifier is tested for a parity problem. It is learnable in a quantum setting, a binary classification task w = {0, 1} and its outcomes are measured on the computational basis. We selected the third qubit in the Pauli-Z direction to perform the measurement and thresholding (Δ) the expectation value hZi�Δ (hZi>Δ) to classify the input vectors in to one of the labels w = 0 (w = 1), respectively. We utilized the gradient descent optimizer (GD) to iterate a parameter update based on the gradient of the loss function. It is used to minimize an objective function to its local minimum by adjusting the parameters repeatedly. The partial derivatives of the cost function are determined with respect to each parameter and store the outcome in a gradient. A step of the GD optimizer determines the new values via the rule θ (t+1) = θ (t) − ηrf(θ (t) ), where η is a user-defined hyperparameter A four-qubit feature map consisting rotation around x-axis to prepare the initial state. It is followed by three layers of variational gates, where each layer consists of rotational gates with three trainable parameters (R(ω j ) = R(ϕ, θ, ω)) in each of the four qubits, followed by a set of CNOT gates. For two classes, the measurement is performed on one qubit, which is enough to have orthogonal measurements for the classes.
https://doi.org/10.1371/journal.pone.0262346.g002 relating to step size. We now consider the concept of Hessian that how it helps to realize the loss landscape (or curvature information) for the parity problem. Firstly, the parameters are initialized randomly and determine the set of parameters that can produce the target state. The optimization problem is translated into the loss function minimization. It is presented as a function of θ 3 and θ 7 , and rest of the parameters are set to the optimal values. Fig 3 shows the loss landscape of a parity problem with a local loss function of parameters, where (θ 1 = θ 3 and θ 2 = θ 7 ). We started with some random initialization. The contour plots Fig 3(b) and 3(d) show the direction of improvement in the optimization process that how the parameters (θ 3 and θ 7 ) are optimized during 100 iterations. The green color point denotes the optimal value. The zoom-out and zoom-in versions of contour plots are shown in Fig 3(

b) and 3(d) for a better view of optimal values.
The minimum value of a loss can be recovered with the gradient descent optimizer due to the point of local convexity in the landscape. Fig 4 depicts the behavior of eigenvalues of the Hessian for parity problem during training. During the optimization process, the Hessian matrix is calculated on all the trainable parameters and eigenvalues are recorded after each iteration. In Fig 4, the eigenvalues are plotted for some specific iterations in ascending order to  fig (a, c), the loss landscape is visualized with a local loss function for two parameters θ 3 and θ 7 , where rest of the parameters are set to the optimal values. Moreover, it cannot be visualize for more than 2 parameters for the 3D Loss, because it cannot obtain the full range of loss between (0 and 1). In fig (b, d), the points in contour plots show the direction of improvement in the optimization process i.e. how the parameters are optimized during iterations . Fig (b) is just the zoom-in version of Fig (d) for better clarification. The green color point depicted the optimal value of θ 3 and θ 7 .
https://doi.org/10.1371/journal.pone.0262346.g003 observe how the behaviour of the eigenvalues is changing with the trainable parameters being updated during the progression of an optimization process. Fig 4(a)-4(d) shows the variations between the minimum and maximum eigenvalues at each iteration. The distribution of eigenvalues for the randomly initialized quantum circuit shows the mixture of negative and nonnegative values close to zero at iterations (0-7) . Fig 4(b) and 4(c) shows some of the eigenvalues are positive, some are negative, and the bulk of them is zero . Fig 4(d) shows a loss for the well converged variational circuit where we left with a single negative eigenvalue of the Hessian and rest all are non-negative (at 100th iteration). In fact, the zero gradient ensures that it is a global minimum. Moreover, the zero eigenvalues correlate to directions where variations in parameters do not alter the curvature information. Thus, Hessian's behavior of the eigenvalues helps to obtain the minimum and maximum stability in the directions of loss landscape.

Classification of diabetes
Let us consider the case of the classical diabetes dataset for the classification of diabetes in supervised learning. In this section, we analyzed the loss landscape through the Hessian for diabetes dataset and investigated how VQC will perform to predict diabetes. The diabetes dataset can be download from UCI machine learning repository [53]. It consists of 8 input features (age, glucose, insulin, pregnancies, body mass index (BMI), skin thickness, diabetes pedigree function, and blood pressure) and one binary output feature.
Firstly, the 8 input features are encoded into the state of qubits by a quantum feature map. Consider a classical dataset D ¼ fðx n ; y n Þg S n¼1 for binary classification, where y n 2 {0, 1} i.e. 0  6. Loss landscape of the diabetes dataset for (θ 0 and θ 24 ). (a) The loss landscape is demonstrated with a local loss function for two qubits θ 0 and θ 24 . Fig (b-c) The direction of improvement in the optimization process is shown using contour plots i.e. how the parameters are optimized during iterations. https://doi.org/10.1371/journal.pone.0262346.g006

PLOS ONE
Variational quantum classifiers through the lens of the Hessian for no diabetes and 1 is for diabetes. Each segment of classical data is encoded into an amplitude of a qubit using single-qubit rotations. Afterward, a variational quantum circuit V(θ) is applied to the feature quantum state for training and classification of diabetes. The 8-qubit variational quantum circuit with a local cost function is constructed, as shown in Fig 5. The feature map consists of Hadamard gates, rotation around y-axis and control-Z entangling gates. It is followed by a variational part of the circuit containing single-qubit rotations R(ω j ) = R(φ 1 , φ 2 , φ 3 ) on each qubit.
The feature map and variational quantum circuits are repeated two times. It is constructed simple enough so that it can be executed on real quantum systems and complex to separate the input data after mapping. For our experiments, the feature map and variational quantum circuits were constructed to a fixed number of depth and qubits. To evaluate the most likely state of an outcome qubit, the quantum circuit can be determined for several iterations by considering a similar input to calculate the probability distribution among the basis states. We used the gradient descent optimization method to find the parameters which can determine the probabilities closest to reality. The measurement is performed on a qubit by employing the Pauli operators in a specific direction. The outcome qubit discovers the predicted value of an input, i.e., the class label values allocated. The loss landscape of VQC for a diabetes dataset with a loss function of parameters (θ 0 ) and (θ 24 ) is visualized in Fig 6, where θ 1 and θ 2 are set to θ 0 and θ 24 , respectively. Here, the other parameters are set to the optimal value after each iteration. In each iteration, the workflow of the optimization process consists of three steps feature map, variational circuit and observation. These are performed on quantum circuits for each of the samples on the dataset and the loss is calculated by taking the distance of prediction from the label for each sample and then averaging them. It is further used in the classical optimization process of the trainable parameters of the variational circuit and finally the trainable parameters are updated as the final step of each iteration. During the observation purposes, the hessian matrix is calculated on the trainable parameters in each iteration and then the eigenvalues of the hessian matrix are recorded only for the data visualization and need not to be incorporated while being implemented on practical purpose. The input vectors are normalized to lie in [−π, π] and plotted the loss landscape for two qubits in a range. If we try to plot the loss landscape for more than two parameters, then it cannot obtain the full range between (0 and 1). The contour plots (b) and (c) show the evolution of optimized parameters during 30 iterations in Fig 7. Furthermore, the loss landscape with a loss function of parameters (θ 1 ) and (θ 25 ) has been analyzed in Fig 7. It has been observed from the landscape that optimum cost is single with good local minima. It shows the prediction map of VQC with Z-measurement for the diabetes dataset. The distribution of Hessian's eigenvalues is determined over the training process of a variational circuit to locate one of the minima. It is used to investigate whether a particular stationary point is a saddle point or not. At the beginning of the training, the gradient descent method is struggling to break the symmetry. Due to the small gradients, it faces a problem in training small quantum circuits. Fig 8(a) shows a distribution of the Hessian's eigenvalues consisting of equally possible negative and positive values, and most of them are zero for iterations (0, 2, and 4). We observed that the negative eigenvalues gradually started to disappear with the increase in number of iterations, as shown in Fig 8(c). After the convergence at 30th iteration, a single negative eigenvalue is left and rest all became non-negative, in Fig 8(d). The bulk of zero eigenvalues shows a flat direction of the f ðy;xÞ in the loss landscape, where any alterations in circuit parameters do not disturb the loss landscape. The positive semi-definite behavior of the Hessian's eigenvalues signifies a very steady result. Although, it is not practical to visualize the loss landscape of VQC in 3-dimensional due to the problem of fixing the other parameters. Nevertheless, it is feasible to visualize the loss landscape of variational quantum classifiers through the lens of the Hessian.

Convergence via adaptive Hessian learning rate
In this section, we show how the Hessian can adjust adaptively to the learning rate (LR) for each parameter. Adaptive learning rate has been a popular practice in classical machine learning. In any gradient based optimization method, a very large learning rate can cause an overshoot in cost while coming close to smaller gradients, whereas very small LR will cause an extremely slow approach towards the lowest point in loss landscape. Therefore, a smart trade off between both leads to the requirement of tuning the Learning rate properly. It is not as simple as it looks. We used an adaptive learning rate, which initially starts with the higher values of learning rate and gradually reduces it with reducing gradient values.
Adaptive Hessian learning rate (A-HLR) is something similar to the previous concept with an add on, specific to the quantum machine learning model's loss landscape analysis. Here, instead of using continuous variation of learning rates, we consider a set of discrete values of learning rates. We begin with the largest LR and gradient descent optimization process. After each step, we compare the updated cost with the previous cost. After few consecutive steps, if the difference between updated and previous cost is below a threshold level, then LR is set to the next smaller discrete value in the set of LR's. The complete process is repeated until we reach the lowest LR in the set of discrete learning rates. Since the loss landscape of quantum machine learning models is quite different from the classical machine learning models. There can be observed a specific repetitive manner of landscape unlike classical model, which leads to the higher probability of getting stuck in local minima.
In case, if the execution reached lowest LR among the set of discrete learning rates, there can be two possible cases (i) either the optimization stuck in a local minima, we need to get out of this, (ii) or the optimization problem reached the global minima. To tackle these scenarios, we considered the concept of Hessian matrix and used it with a set of decreasing learning rates. The implementation is started with the largest learning rate, for each of values of the learning rate in the set and the process of optimization continues until the difference of loss values in two consecutive iterations goes below tolerance. Once it occurs, then the next learning rate is considered from the predefined set and the process is repeated until it reaches the least learning rate value. If the count of negative eigenvalues is lesser than a threshold value, then the solution obtained is a global minima. In case, if the count is greater, then it got stuck in local minima. It is to be noted that the threshold value of negative eigenvalue count is a hyperparameter, which has to be tuned properly depending on the type of data in a given dataset. (Practically it should be zero, but in reality its not always possible to reach that point). If the model is stuck at local minima, i.e. the learning rate has already reached the lowest of its all possible given values. Then, the optimization will be again started with the higher learning rate with an objective that the cost will overshoot and come out of the local minima. The process is repeated until the LR comes to its lowest value. Finally, it is evaluated whether the optimization has reached global minima or not using the Hessian matrix. Fig 9(a) shows the comparison that how the cost is evolving throughout the optimization process with an adaptive Hessian learning rate (A-HLR) method and gradient descent method with constant learning rates. It has been determined that A-HLR converged very well within 25 iterations. The gradient descent method with LR = 0.5 learning rate also converges, but not as quick as with A-HLR. Fig 9(b) shows how the value of adaptive learning rate evolves on using A-HLR during first 25 iterations and depicts how it overshoots the local minima of the cost function using lowest to highest LRs. Fig 9(c) shows the comparison between A-HLR and gradient descent methods with different learning rates for 100 iterations. Fig 9(d) depicts how the value of adaptive learning rate evolves on using A-HLR during the first 100 iterations. A-HLR fits the local shape of gradient to the loss landscape very well, provides a faster convergence than gradient descent method with constant learning rates. It permits one to select a descent direction for faster convergence during the training of variational circuits. Therefore, the local traps in the loss landscape can be avoided by using the adaptive hessian learning rate approach.

Conclusion
In this paper, the curvature information of the loss landscape of variational quantum classifiers has been visualized via the lens of Hessian eigenvalues. We developed a simple theoretical quantum model of Hessians and gradients, as justified by datasets for numerical justifications on VQCs. The parity function problem is considered as a warm study to show the behavior of Hessian's eigenvalues. It has observed that VQC has an exceptional ability to generalize small datasets. Furthermore, we visualized the cost function landscape of VQC designed for the diabetes dataset. It converges efficiently for data-driven problems. We identified some differences in the convergence with adaptive Hessian learning rate and gradient descent method using fixed learning rate. It has been observed that adaptive Hessian learning rate helps to overshoot the cost if it gets fall into local minima and converge quickly. It seems beneficial to study the local curvature information of VQC through the Hessian. The integration of gradient-based methods and Noisy Intermediate-Scale Quantum (NISQ) devices is still a young area and potentially has a lot more to offer. In the future, this work will open up new avenues of research in solving classical and quantum optimization problems and framework design. It will help the research communities to accelerate the analysis of variational quantum algorithms based on Hessian.

Author Contributions
Conceptualization: Amandeep Singh Bhatia, Kamalpreet Singh Bhangu.  fig (a, c), the comparison is shown between the adaptive Hessian learning rate and gradient descent method with constant learning rates. A-HLR shown faster convergence than gradient descent with fixed learning rates. It has been shown the stable and efficient convergence of the cost function for parity dataset during training . Fig (b, d) shows how A-HLR evolves during training of 25 and 100 iterations. It depicts how to overshoot the local minima of the cost function using adaptive Hessian lower to higher learning rates.