Multiqubit and multilevel quantum reinforcement learning with quantum technologies

We propose a protocol to perform quantum reinforcement learning with quantum technologies. At variance with recent results on quantum reinforcement learning with superconducting circuits, in our current protocol coherent feedback during the learning process is not required, enabling its implementation in a wide variety of quantum systems. We consider diverse possible scenarios for an agent, an environment, and a register that connects them, involving multiqubit and multilevel systems, as well as open-system dynamics. We finally propose possible implementations of this protocol in trapped ions and superconducting circuits. The field of quantum reinforcement learning with quantum technologies will enable enhanced quantum control, as well as more efficient machine learning calculations.


Introduction
Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that has attracted increasing attention in the last years. ML usually refers to a computer program which can learn from experience E with respect to some class of task T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E [1]. In other words, Machine Learning addresses the problem of how a computer algorithm can be constructed to automatically improve with experience. Several applications in this field have been implemented such as handwriting pattern recognition [2], speech recognition [3] and the development of a computer able to beat an expert Go player [4], just to name a few.
The learning process in ML can be divided in three types: supervised learning, unsupervised learning and reinforcement learning [5]. In supervised machine learning, an initial data set has the function of training the system for later prediction making or to classify data. Usually, supervised learning problems are categorized into regression (continuous output) or classification (discrete output). Unsupervised learning allows one to address problems where the training data is not necessary and only correlations between subsets in the data (clustering) are considered and analyzed. Finally, reinforcement learning [6] differs from supervised and a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 unsupervised learning in that it takes into account a scalar parameter (reward) to evaluate the input-output relation in a trial and error way. In this case, the system (so-called "agent") obtains information from its outer world ("environment") to decide which is the better way to optimize itself, for adapting to the environment.
The field of quantum technologies has grown extensively in the past decade. In particular, two architectures which are very promising for the implementation of a quantum computer, in terms of numbers of qubits and gate fidelities, are trapped ions [34,35] and superconducting circuits [36][37][38]. Current technological progress in trapped ions has allowed us to implement quantum protocols with several ions involving high-fidelity single and two-qubit gates as well as high-fidelity readout [39,40]. Superconducting circuits have also proven to be an excellent platform to perform quantum information processing protocols because of their individual addressing and scalability. Two-qubit quantum gates have achieved fidelities larger than 99% [41,42] in this platform. Furthermore, technological progress in this architecture has made possible to build artificial atoms with high coherence time in coplanar [43] and 3D architecture [44], allowing for the development of feedback control with superconducting circuits [45,46]. This feedback mechanism has inspired protocols for quantum reinforcement learning with superconducting circuits [23] where the feedback loop control allows one to reward and restart the system to obtain maximal learning fidelity.
Here, we propose a general protocol to perform quantum reinforcement learning with quantum technologies. We understand general in the sense that it goes beyond the context of qubits for embedding information in agent or environment. In this sense, and at variance with a previous result [23], we extend the realm of the quantum reinforcement learning protocol to multi-qubit, multi-level, and open quantum systems, therefore permitting a wider set of scenarios. Our protocol considers a quantum system (the agent), which interacts with an external quantum system (its environment) via an auxiliary quantum system (a register). The aim of our quantum reinforcement learning protocol is for the agent to acquire information from its environment and adapt to it, via a rewarding mechanism. In this fully quantum scenario the meaning of the learning process is the establishment of quantum correlations among the parties [21]. In our specific case, the quantum agent aims at attaining maximum quantum state overlap with the environment state, in the sense that local measurements on agent and environment will produce the same outcomes or, equivalently, that the agent and environment entangled final state is invariant under the exchange of these two subsystems. An interpretation of this outcome is that the agent can learn about the information embedded in the environment state, which has been consequently modified from a separable to an entangled state with the agent and registers. After this process we are in position of evaluating any figure of merit with the outcome measurements. Optimizing this figure of merit should be associated to a particular learning process probably requiring particular actions to be applied on the agent. Another possible result is obtained by considering projective measurements in the register systems. Only after these projective measurements agent and environment will be decoupled from them and the protocol assures that the former are in a pure correlated state, without needing to know any information about their initial states. We analyze the case where the register subspace is larger than agent and environment subspaces. The inclusion of more elements in the register subspace allows for delaying the application of the rewarding criterion to the end of the quantum protocol. This fact will enable its implementation in a wider variety of quantum platforms, besides superconducting circuits with coherent feedback. We also study quantum reinforcement learning in the case where agent, environment and register are composed of qudits. In this case, we obtain that the maximal learning fidelity is achieved in a fixed number of steps in the qudit dimension, and this number scales polynomially with the number of subsystems in the environment subspace. In addition, we analyse quantum reinforcement learning in the situation where the environment is larger than the agent. We highlight two results: the first of them is obtained when considering that the register has the same elements than the environment. In this case, two rewarding criteria are needed to obtain maximal learning fidelity and the entanglement between the agent and a specific part of the environment is a key resource. The other case is the situation where the register has more elements than the environment. In this case, only one measurement is needed to obtain maximal learning fidelity and the environment-agent entanglement is not a key resource. Based on this fact, the rewarding criterion is applied at the end of the protocol. Finally, we describe how our quantum learning protocols can be implemented in quantum platforms as trapped ions and superconducting circuits.

Quantum reinforcement learning protocol with final measurement
Here, we introduce a protocol to perform quantum reinforcement learning, which introduces significant novelties with respect to the existing literature. Unlike a previous quantum reinforcement learning result [23], the protocol described here needs one measurement at the end of the procedure and no feedback, allowing for its implementation in a variety of quantum platforms including ions and photons. The improvement relies on adding more registers than before [23] and making them interact conditionally with each other. The inclusion of ancillary systems has proven to be useful in several implementations of quantum information, because measurements on the ancillary system allow one in principle to obtain information about the main system without destroying it. Moreover, the measurement associated with the rewarding criterion is performed at the end of the protocol. This opens the possibility to implement quantum reinforcement learning protocols in architectures for which implementing coherent feedback may be a challenging problem.
The quantum reinforcement learning protocol described here works in the following way. We firstly consider an agent and environment, composed of one qubit each, and two register qubits, see Fig 1. The first step is to encode the environment information in the register states (usually this kind of operation in the context of classical reinforcement learning is called the action). Subsequently, the internal states of the registers interact conditionally with the agent (usually this kind of operation in classical reinforcement learning is called the percept). Finally, an agent-register interaction changes the agent state (partial rewarding mechanism). At this stage the rewarding criterion is satisfied, in the form of a correlated agent-environment state, in the sense that local measurements on agent and environment will produce the same outcomes. On the other hand, the agent-environment system is also entangled with the two registers, and in order to attain a correlated pure state of agent and environment, a single, final measurement may be performed on the two register states. This will produce an agent-environment state maximizing the learning fidelity defined as F AE ¼ jhc A j0 E ij, where |ψ A i is the agent state and |f E i is the environment state, both after the protocol.
To perform our quantum reinforcement learning protocol we consider that initially agent and environment are in arbitrary single-qubit pure states, whereas the register states are in their ground state, namely The first step in the protocol is to extract information from the environment, updating the information in the registers conditionally to the environment state. This process is done by applying a pair of CNOT gates in the environment-register subspace. Here, the first system is the control and the second the target, Then, the information encoded on the registers is updated conditional on the agent state. As the register subspace is larger than the agent subspace, we will choose which part of the register subspace will the agent update. Without loss of generality, let us assume that the register R 1 will be updated. The upgrade of agent subspace is performed by a CNOT gate acting in the A − R 1 subspace, where the agent state is the control and the register is the target, Subsequently, the register R 2 is also updated with respect to the R 1 state. This is accomplished by applying a CNOT gate in the register subspace, where R 1 acts as control and R 2 as target, Followingly, we update the agent state according to the information encoded in the register R 1 . This is done by applying a CNOT gate in the R 1 − A subspace, where R 1 is the control and A is the target, We point out that, in the previous state, agent and environment are already maximally correlated, in the sense of having the same outcomes with respect to local measurements performed on either of them, or, equivalently, the state is invariant under particle exchange with respect to the agent-environment subsystem. We also remark that this state is general, valid for any initial agent and environment states. The fact that agent and environment get entangled with the two registers allows one to distinguish between identical agent-environment components that originate from different initial states, namely, to distinguish between states arising from a 0 A a 0 E or a 1 A a 0 E , as well as from a 0 A a 1 E or a 1 A a 1 E . Finally, by performing a projective measurement on the register subspace, the rewarding criteron is satisfied. It is easy to show that, independently of the measurement outcome, the learning fidelity F AE ¼ jhc A j0 E ij is maximal, given that agent and environment states end up being in the same state, either |0i or |1i. In this case only one iteration of the protocol is sufficient in order that the agent adapts to the environment. Moreover, throughout the protocol, measurements on agent and/or environment are not required, which may allow its implementation in a variety of quantum platforms as trapped ions, superconducting circuits, and quantum photonics.
In our protocol, we do not need coherent feedback given that the registers entangle with agent and environment and as a result produce the desired agent-environment state that is invariant under permutation. It is true that the entanglement with the registers produces a mixed state in case the register states are discarded, but this is not a drawback in our protocol. Indeed, what our protocol does is, for arbitrary initial agent and environment states, which need not be known, to give a constructive way to produce a final agent-environment state perfectly correlated, in the sense of invariant under permutations in agent-environment subspace. This state is in general entangled, namely, quantum, and we do not need to perform any measurement on agent and environment during the protocol, namely, it can equally well work with photons, ions, and superconducting circuits, among others. After the production of the agent-environment-register entangled state, the registers are entangled with agent and environment, but this does not prevent us from measuring the registers at a certain desired time, and decoupling agent and environment from them. This way, we will not have measured agent and environment at any time of the protocol, and we can assure that they are perfectly correlated irrespective of their initial states, and without having any prior information about them. This may be useful, e.g., for distributing private keys in quantum cryptography for arbitrary, unknown, initial states, without the need to initialize agent and register in reference states.

Quantum reinforcement learning for multiqubit systems with final measurement
In the previous section, we have showed that by considering more than just one register the rewarding criterion in the quantum reinforcement learning algorithm can be done at the end of our protocol. The same results can be obtained when we consider more complex configurations. Indeed, by assuming that agent and register are composed of two qubits each, and four qubits act as registers, we show that the rewarding criterion can also be applied at the end of the quantum protocol. Let us illustrate this fact with an analysis for multiqubit agent, environment, and register states, Following the same procedure described previously, the protocol consists mainly in three types of interaction, as shown in Fig 2. Firstly, we update the registers conditionally to the environment states. More specifically, we consider an interaction between the environment qubits E 1 and E 2 with the registers R 1 and R 2 , respectively. In this description, the environment acts as control and the registers act as targets in the CNOT gates, Thereafter, we update similarly the remaining registers, that is, we apply a CNOT gate between the environment qubits E 1 and E 2 and the register qubits R 3 and R 4 , respectively, obtaining Next step consists in updating a part of the register subspace conditionally to the agent state.
Thus, the registers R 1 and R 2 will be updated via A 1 and A 2 , respectively, Afterwards, to obtain orthogonal outcomes in the register subspace we perform a pair of CNOT gates in this subspace. The interaction will be between the registers that interact with a common environment, namely, register R 1 interacts with R 3 because both have interacted with E 1 . Similarly for R 2 and R 4 , which have interacted with E 2 . In this case, R 1 (R 2 ) is the control and R 3 (R 4 ) is the target.
Finally, we update the agent considering the states of the register in order that the rewarding criterion is satisfied. This is done by applying two CNOT gates in the agent-register subspace, where A 1 is controlled by R 1 and A 2 is controlled by R 2 , From the latter Eq (16), it is straightforward to see that independently of the measurement outcomes the learning fidelity is maximal. Moreover, as in the previous case, one iteration of the quantum reinforcement protocol is needed to obtain maximal learning fidelity,

Quantum reinforcement learning for qudit systems
So far, we have studied quantum reinforcement learning processes only for two-level systems or in pairs of them. However, there are several quantum systems which cannot be described in terms of a two-level system. For instance, quantum harmonic oscillators, electronic energy levels in an ion, and superconducting artificial atoms such as transmons [47], where for some regimes of Josephson energy they must be considered as a three-level system. In this context, it is interesting to extend the quantum reinforcement learning protocol developed here for cases where multilevel systems compound the agent, environment, and register. To perform the previous task, we first need to define a set of logic operations that we will perform on our system. In the qubit case, the main logical operation applied is the CNOT gate, which considers a conditional interaction between two qubits, where one acts as a control while the other acts as a target. The control qubit remains unchanged whereas the target qubit output is modified by the addition modulo 2. Then, it is wise to assume that the set of logic operations between multilevel systems could be defined in terms of an addition modulo D, where D stands for the dimension of one subsystem (agent, environment or register subspaces), according to Here, i È j stands for the addition modulo D. This gate is usually known as XOR gate [48]. For two-dimensional systems, this gate corresponds to the CNOT gate. Nevertheless, for higher dimensional systems this definition presents several disadvantages. For instance, the XOR gate defined as in Eq (17) is unitary but not Hermitian for D > 2. Moreover, this logical operation is no longer its own inverse. To avoid these problems, in the literature [48] the generalized XOR gate (GXOR) has been defined as where the operation É denotes the difference i − j modulo D. The GXOR gate of Eq (18) does not present the disadvantages pointed out in the definition of Eq (17). That is, the GXOR gate is Hermitian, unitary and i É j = 0 only when i = j. Considering our proposed protocol for single-qubit cases, we show that when we take into account multilevel systems, the number of interactions to obtain maximal learning fidelity is fixed and depends only on the number of agent subsystems in the protocol. Let us illustrate this with an example of multilevel agent-environment-register state, The first step in our protocol is identical to the equivalent one in the single-qubit case. We update the register conditionally on the environment state, that is, we transfer information of the environment and encode it in the register system. This is done by applying a pair of GXOR gates acting in the environment-register subsystem. In this case, the environment interacts with both registers R 1 and R 2 . The environment acts as control and both registers are targets, Once the information has been transferred to the register, we update the register R 1 based on the agent state. That is, we perform a GXOR gate in the subspace composed of agent and register. Here, the agent act as a control and the register R 1 is the target, Orthogonal outcome measurements in the register subspace are provided by interactions between the registers in this subspace. Thus, we apply a GXOR gate in the register subspace, where R 1 is the control and R 2 is the target, Subsequently, the agent state is updated conditionally to the information encoded in the state of the register R 1 . The GXOR gate is applied in the register-agent subspace. In this case, R 1 is the control and the agent is the target, For the case where the multi-level system contains D ¼ 2, we recover the result discussed previously because of 0 É m = m for that dimension. On the other hand, we are interested in systems with more energy levels, such that we need to adapt the protocol to obtain maximal learning fidelity for a fixed number of steps. In this case, we will update the agent subsystem by an iterative interaction with registers R 1 and R 2 as shown in Fig 3. Here, the agent always acts as target, while the registers are the controls. Therefore, we apply a GXOR gate between the register R 2 and the agent, Now, by applying a GXOR gate between the register R 1 and the agent we obtain, 1 ;AÞ jC 6 i; We perform subsequently a GXOR gate in the subspace composed of R 2 and agent A, Finally, applying a GXOR gate on the register-agent subspace we obtain the desired result. By considering a fixed number of interactions between the set of agent, environment and register, the learning fidelity becomes maximal independently of the outcome measurement on the register subspace, which can again be carried out at the end of the protocol, 1 ;AÞ jC 8 i; Thus, in a machine learning protocol where the learning units are composed by multilevel systems (see Fig 3), the number of logical operations required to obtain maximal learning fidelity does not depend on the system dimension.

Example
Here, we exemplify how our reinforcement learning protocol works in qudit systems. We consider, without loss of generality, the case for dimension D ¼ 4. In this case, the agent-environment-register state has the following form, jCi 0 ¼ jAijEijRi: As mentioned previously, the considered quantum gate is a GXOR gate with subtraction modulo 4. The first step is to update the register according to the environment information, Subsequently, the register is updated conditional to the agent state, Then, to obtain orthogonal outcome measurements in the register basis, we perform an interaction in the register subspace, Now, we need to apply iterative interactions in the register-agent subspace to update the agent in each step until we get maximal learning fidelity with respect to the environment. We start by performing a GXOR gate between the register R 1 and the agent, Hereafter, we apply the GXOR gate in the R 2 -agent subspace, Afterwards, we perform a GXOR gate between R 1 and A, Subsequently, an interaction in the R 2 -agent subspace is performed, Finally, we apply a GXOR gate between R 1 and the agent, As we can see, based in the quantum protocol described previously (see Fig 3), we have shown that for a fixed number of interactions, we obtain maximal learning fidelity even though the system has an arbitrary dimension.

Quantum reinforcement learning in multiqudit systems
In the previous section, we proved that for an agent and environment composed of a multilevel system each, the quantum reinforcement learning protocol entails maximal learning fidelity for a fixed number of steps, irrespective of the dimension. Here, using this result, we also prove that for more than one multilevel system in agent, environment, and register subspaces, the number of steps is also fixed and scales with the number of individual subsystems that compose both agent and environment subsystems. To be more specific, in the single-multilevel case the needed total steps are nine. For two multilevel systems, we show that the number of required steps are eighteen, and in general, 9n, with n being the number of multilevel subsystems. The possible initial states of our protocol consist in arbitrary superpositions for both agent and environment states and the register states are in their ground state, The first step in the protocol consists in encoding the environment information in the register states. This is done by applying a pair of GXOR gates. The gates are applied in the environment-register subspace, while the interaction in this case is the same as the one described previously. Namely, E 1 controls R 1 and E 2 controls R 2 .
Similarly, in the second step we encode the environment information in the other two registers (R 3 and R 4 ) through GXOR gates. Here, the control system is the environment while the targets are the registers.
Subsequently, a part of the register subspace is updated conditional on the agent information. Therefore, we apply a pair of GXOR gates on the agent-register subspace. In this case, agents A 1 and A 2 are controls and registers R 1 and R 2 targets.
Now, we update the register subspace considering interactions between register components which have been acted upon with the same part of the environment. Namely, the register R 3 will be updated with the control of R 1 (Similarly with R 4 being controlled with R 2 ).
Subsequently, we need to apply successive interactions between agent states and register states to obtain maximal learning fidelity. We show that applying the same interactions as for the single multilevel case for the triplet formed by agent A 1 with the environment parts R 1 and R 3 (similarly A 2 with R 2 and R 4 ), the maximal learning fidelity is reached. It is straightforward to show that Summarizing, for the case studied in this section, we demonstrate that the number of operations required to obtain maximal learning fidelity does not depend on the learning unit dimension and it is equal to eighteen operations, which correspond to the double of the required steps in the single multiqubit case. It is straightforward to realize that the number of needed operations to achieve maximal learning fidelity in a machine learning protocol composed by n subsystems for agent and environment is equal to 9n. Namely, the number of operations scales polynomially, indeed linearly, with the number of subsystems.

Quantum reinforcement learning in larger environments
Up to now, the quantum reinforcement learning protocol described here always considers that the agent and the environment have the same number of subsystems, as well as the same dimension. In these cases, we have shown that by adding more system registers the quantum protocol improves in the sense that only one iteration and one measurement is enough to obtain maximal learning fidelity. Nevertheless, in more realistic scenarios, the agent must adapt to larger or more complex surroundings. Here, we discuss the situation where the environment has more subsystems than the agent, and therefore a larger dimension. As the environment has more information than the agent, it is expect that not all available surrounding information will be transferred to the agent. Indeed, we prove that by depending on the register-environment interaction, the agent can encode the information from one specific part of the environment. In this case, unlike the protocol previously discussed, we achieve maximal learning fidelity after applying one measurement and a rewarding iteration (feedback). The proposed quantum protocol is shown in Fig 4. Here, one two-level system forms the agent, while register and environment are constituted each by two qubits. Each environment qubit interacts with one qubit from the register, such that this interaction updates the registers conditionally to the environment information. Then, one part of the register subspace is also upgraded conditionally to the agent state. Subsequently, we perform a measurement on the register subspace, such that depending on the measurement outcomes we apply a conditional operation in the agent-register subspace until the agent adapts to a specific part of the environment. To illustrate this, let us introduce a possible agent-register-subspace state which has the following form, The first step is to transfer quantum information from the environment onto the registers. This is done by applying a pair of CNOT gates in the environment-register subspaces, Subsequently, the register R 1 is updated conditionally to the agent information. Therefore, a CNOT gate is applied in the agent-register subspace, where the agent qubit is the control and the register R 1 is the target, Afterwards, we perform a measurement on the register subspace. In this case, the wave function is projected into the four possible measurement outcomes, As we can see, the projective measurement on the register subspace produces that agent and one part of the environment subspace (E 1 ) is in an entangled state. At this stage, we can apply the rewarding criterion which consists in performing a CNOT gate operation in the registeragent subspace. The register qubit R 1 is the control and the agent is the target, Finally, we perform a CNOT gate in the agent-register subspace to obtain orthogonal measurement outcomes. The qubit agent is the control and the qubit register R 1 is the target, according to In this quantum reinforcement learning protocol, we perform interactions between the environment and the register subspaces. Nevertheless, the agent is updated only regarding the information encoded in register R 1 . Thus, the maximal learning fidelity is achieved with respect to the first qubit of the environment. Let us now consider another configuration similar to the one studied previously in this article, where the register is formed by a larger number of subsystems than the environment. Here, additionally, the environment we consider is larger than the agent. We prove that, for this system configuration, maximal learning fidelity between the agent and one part of the environment is achieved in one rewarding process. For this configuration, the maximal fidelity does not depend on the entanglement present in the agent-environment subspace. The general agent-register-environment state is The quantum protocol consists in updating the registers R 1,2 conditionally to the environment state E 1,2 , After this, we also update the information of the registers R 3,4 conditionally to the environment state E 1,2 , Now, the register R 1 is updated conditionally to the agent state, Then, the next step would consist in updating a part of the register subspace from the information encoded in the other part. However, this step is not necessary because the number of terms in Eq (63) is smaller than all the possible measurement outcomes in the register subspace. Thus, the register is always projected onto orthogonal measurement outcomes. On the other hand, we update the agent state from the information encoding in the register R 1 . Therefore, we perform a CNOT gate in the register-agent subspace, where the register R 1 is the control and the agent is the target, By measuring the register subspace, we obtain that agent and environment qubit E 1 achieve maximal fidelity.

Quantum reinforcement learning for mixed states
Let us consider now the situation where the environment evolves under a noisy mechanism (for qubit states, noisy mechanisms can be depolarizing noise as well as amplitude damping). In this case, the density matrix describing the environment state reads We focus now our attention in the application of the quantum reinforcement learning protocol in this type of state. We will show that, by adding more registers, two main results will be obtained. Firstly, even though the environment is in a mixed state, the learning fidelity will be maximal for any measurement outcome in the register basis. Additionally, the measurement outcomes provide relevant information about the coherences of the mixed state. To apply the quantum protocol, we express the mixed state in term of its (non-unique) purification, such as jc e i ¼ r 10 ffiffiffiffiffiffi r 00 p je 1 i þ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi r 11 À jr 10 j 2 r 00 s 2 Here, j " c e i is a normalized vector in the purification Hilbert space. As we can see, the coefficient of the quantum state written in its extended Hilbert space (environment + purification) depends only on the diagonal terms of the mixed state. Moreover, to obtain additional information about the mixed state, we need to perform unitary transformations on it in such a way that the information related to the coherences is in the diagonal of the state after the transformation. To be more specific, we need to perform unitary transformations such that the mixed state can be written as follows, " r ! U y rU y y ¼ 1 2 1 þ ðr 01 þ r Ã 01 Þ r 11 À r 00 þ ðr 01 À r Ã 01 Þ r 11 À r 00 À ðr 01 À r Ã 01 Þ 1 À ðr 01 þ r Ã 01 Þ ! ; ð68Þ To carry out this task, we need to add three more registers, where each of them has the function to encode information of diagonal, real, and imaginary part of the coherence terms, respectively. A possible state for the space composed of agent, mixed environment and register register R 1 (where A is the target and R 1 is the control), This quantum reinforcement learning protocol exhibits two features. First, by performing projective measurements on registers R 1 , R 2 and R 3 , we recover the result studied in the first section, i.e., the learning fidelity is maximal independently of the measurement outcomes in the register subspace. The second feature is that, for specific measurement outcomes in a part of the register subspace, we obtain information about the population (diagonal) and the coherences (off-diagonal) of the mixed state. This feature can be used in problems such as partial cloning in cases where the system in which we can extract information evolves under loss mechanisms.

Analysis of implementation in quantum technologies
An interesting result obtained in this manuscript is that in most of the cases, for the considered quantum reinforcement learning protocols, adding more registers improves the rewarding process. That is, via a purely unitary evolution, without coherent feedback, a maximally positively-correlated agent environment state is achieved, in the sense that the final agent contains the same quantum information as the considered final environment. This means that the agent has acquired the needed information about the environment and accordingly modified it, being this a quantum process. In our formalism, typically, one measurement at the end of the protocol is enough to obtain maximal learning fidelity in one iteration of the process. In this sense, several quantum architectures could benefit of this fact, given that coherent feedback is not needed in this case. For instance, we focus our attention in two prominent platforms, namely, trapped ions and superconducting circuits.

Trapped ions
As we have pointed out along the manuscript, the performance of our proposed quantum protocols is based on the quality of the quantum gates between different subsystems. In this case, the realization of high-fidelity quantum gates is essential to perform the quantum protocol proposed here. Technological progress in trapped ions has enabled to implement single [49] and two-qubit quantum gates [50] with a large fidelity. For the single-qubit gate, e.g., a Beryllium hyperfine transition can be driven with microwave fields or lasers, being the error associated with single-qubit gates below 10 −4 . For two-qubit gates, the use of either microwaves or a laser beam with modulated amplitude allows for the interaction of both qubits (electronic levels of, e.g., Beryllium or Calcium ions) at the same time. Adiabatic elimination of the motion allows one to obtain maximally entangled states of both ions. The fidelity of trapped-ion twoqubit gates can reach nowadays above 99.9% [51,52]. Trapped-ion technologies offer long coherences times, which can reach up to the range of seconds [53] for Calcium atoms. In addition, this platform enables state preparation and readout with high fidelity [39,54,55]. Here, the use of hyperfine states and the microwave fields improve the optical pumping fidelity and improve the relaxation time T 1 allowing to obtain fidelity readouts of 99.9999% [54].

Superconducting circuits
As in trapped ions, the technological progress in superconducting circuits has grown significantly in the latter years. For instance, artificial atoms whose coherence times are in the microsecond range have been built in coplanar [43] and 3D architectures [44]. On the other hand, integrated Josephson quantum processors allows one to implement quantum gates between two-level systems even in cases where the qubits do not have identical frequencies, as well as making them interact via a quantum bus [56]. The Xmon qubits achieve two-qubit gate fidelities above 99% [41,42]. These technological progresses have developed feedback loop control in this platform. This feedback protocol relies on high fidelity readout, as well as on conditional control on the outcome of a quantum non-demolition measurement [45,46]. Even though in the quantum reinforcement learning protocols in this paper coherent feedback is not required, this may be a useful ingredient in other quantum reinforcement learning proposals [23].

Discussion
In summary, we propose a protocol to perform quantum reinforcement learning which does not require coherent feedback and, therefore, may be implemented in a variety of quantum technologies. Our learning protocol, being mostly unitary (except with the final register measurement) considers learning in a loose sense: while it does not depend on feedback, the protocol achieves its aim regardless of the initial state of agent and environment. In this aspect, it is general, and obtains a similar goal than Ref. [23] without the need of feedback, enabling its implementation in a variety of quantum platforms. We also point out that one may employ different performance measures than the one considered here, depending on the agent possible aims. Adding more registers than in previous proposals in the literature [23], the rewarding criterion can be applied at the end of the protocol, while agent and environment need not be measured directly, although only via the registers. We also obtain that when the considered systems are composed of qudits, the number of steps needed to obtain maximal learning fidelity is fixed in each qudit dimension and scales polynomially with the number of qudit subsystems. We consider as well environment states which are mixtures, while the agent can also in this case acquire the appropriate information from them. Theoretically, all the cases considered of qubit, multiqubit, qudit, and multiqudit, have many similarities. Even though the protocols are not directly transformable into one another, a d-dimensional qudit can be rewritten as a log 2 (d) multiqubit system, while a multiqudit system with n qudits is equivalent to an n log 2 (d) multiqubit system. Therefore, in this respect, it is intuitive that the results for all these protocols (namely, that maximal fidelity can be attained) should be related. Nevertheless, it is valuable to show that the protocol can be scaled up to multiqudit systems with many parties and high dimensions, given that this will be an ultimate goal of a scalable quantum device. Implementations of these protocols in trapped ions and superconducting circuits seem feasible with current platforms.