Validating quantum-classical programming models with tensor network simulations

The exploration of hybrid quantum-classical algorithms and programming models on noisy near-term quantum hardware has begun. As hybrid programs scale towards classical intractability, validation and benchmarking are critical to understanding the utility of the hybrid computational model. In this paper, we demonstrate a newly developed quantum circuit simulator based on tensor network theory that enables intermediate-scale verification and validation of hybrid quantum-classical computing frameworks and programming models. We present our tensor-network quantum virtual machine (TNQVM) simulator which stores a multi-qubit wavefunction in a compressed (factorized) form as a matrix product state, thus enabling single-node simulations of larger qubit registers, as compared to brute-force state-vector simulators. Our simulator is designed to be extensible in both the tensor network form and the classical hardware used to run the simulation (multicore, GPU, distributed). The extensibility of the TNQVM simulator with respect to the simulation hardware type is achieved via a pluggable interface for different numerical backends (e.g., ITensor and ExaTENSOR numerical libraries). We demonstrate the utility of our TNQVM quantum circuit simulator through the verification of randomized quantum circuits and the variational quantum eigensolver algorithm, both expressed within the eXtreme-scale ACCelerator (XACC) programming model.


Introduction
Quantum computing is a computational paradigm that relies on the principles of quantum mechanics in order to process information. Recent advances in both algorithmic research, which has found remarkable speed-ups for a growing number of applications [1][2][3], and hardware development [4,5] continue to progress the field of quantum information processing. PLOS  The near-term state of quantum computing is defined by the noisy intermediate-scale quantum (NISQ) paradigm which involves small-scale noisy quantum processors [6] being used in a hybrid quantum-classical framework. In this context, recent experimental demonstrations [7][8][9][10][11] of hybrid computations have reinforced the need for robust programming models and classical validation frameworks. The successful integration of quantum processors into conventional computational workloads is a complex task which depends on the programming and execution models that define how quantum resources interact with conventional computing systems [12,13]. Many different models have been proposed for programming quantum computers and a number of software development efforts have begun focusing on high-level hybrid programming mechanisms capable of integrating both conventional and quantum computing processors together [14][15][16][17][18][19][20][21]. For example, recent efforts have focused on Python-based programming frameworks enabling the high-level expression of quantum programs in a classical context, which may target numerical simulators or a variety of physical quantum processing units (QPUs) [22][23][24]. The eXtreme-scale ACCelerator programming model (XACC) is a recently-developed quantum-classical programming, compilation, and execution framework that enables programming across multiple languages and QPU targets, including both virtual and physical QPUs [25].
In all cases, the verification of quantum program correctness is a challenging and complex task due to the intrinsically noisy nature of near-term QPUs, and this is additionally complicated by remote hosting. As a remedy, numerical simulation techniques can greatly expedite the analysis of quantum-classical programming efforts by providing direct insight into the prepared quantum states, as well as serving to test a variety of quantum computing hardware models. Modeling and simulation is essential for designing effective program execution mechanisms because it provides a controlled environment for understanding how complex computational systems interact, subsequently generating feedback based on the state machine statistics. For example, the performance of existing QPUs is limited by the hardware connectivity [4] and numerical simulations can draw on a broad range of parameterized models to test new processor layouts and architectures.
In practice, exact brute-force simulations of quantum computing are notoriously inefficient in memory complexity due to the exponential growth in resources with respect to the system size. These brute-force methods explicitly solve the Schrodinger equation, or a mixed-state master equation, using a full representation of the quantum state in its underlying (exponentially large) Hilbert space. Limits on available memory place upper bounds on the size of the vectors or density operators that can physically be stored, severely restricting the size of the simulated quantum circuit. Even with the availability of current large-scale HPC systems, including the state-of-the-art supercomputing systems, recent records for quantum circuit simulations are limited to less than 50 qubits [26,27]. The performance of the brute-force quantum circuit simulators on current supercomputing architectures is also limited by the inherently low arithmetic intensity (Flop/Byte ratio) of the underlying vector operations (sparse matrix-vector multiplications) required for simulating a discrete sequence of one-and two-qubit gates.
The inherent inefficiency of the brute-force state-vector quantum circuit simulators has motivated a search for approximate numerical simulation techniques increasing the upper bound on the number of simulated qubits. As we are interested in general-purpose (universal) quantum circuit simulators, we will omit efficient specialized simulation algorithms that target certain subclasses of quantum circuits, for example, quantum circuits composed of only Clifford operations [28]. As a general solution, we advocate for the use of tensor network (TN) theory as a tool for constructing factorized approximations to the exact multi-qubit wave-function tensor. The two main advantages offered by the tensor-network based wave-function factorization are (1) the memory (space) and time complexity of the quantum circuit simulation reflect the level of entanglement in the quantum system, (2) the numerical action of quantum gates on the factorized wave-function representation results in numerical operations (tensor contractions) which become arithmetically intensive for entangled systems, thus potentially delivering close to the peak utilization of modern HPC platforms.

Quantum circuit simulation with tensor networks
Tensor network theory [3,29] provides a versatile and modular approach to dimensionality reduction in high-dimensional tensor spaces. For the following discussion, a tensor is a generalization of a vector that is defined in a linear space constructed as the direct (tensor) product of two or more primitive vector spaces. Consequently, the components of a tensor T i 1 ...i n are enumerated by a tuple of indices, instead of by a single index as is the case for vectors. From the numerical perspective, a tensor can be viewed as a multi-dimensional array of objects, which may be real or complex numbers. In this work, following the physics nomenclature, we shall refer to the number of indices in a tensor T i 1 ...i n as its rank, which is n in this case (in math nomenclature n is the tensor order). Each index represents a distinct vector space contributing to the composite space defined by the tensor product. The extent of the range of each index gives the dimension of the corresponding vector space. In their essence, tensor networks aim at decomposing a higher-rank tensor into a contracted product of lower-rank tensors (tensor factors) such that this factorized product of lower-rank tensors reconstructs the original tensor with sufficient accuracy (i.e. a variant of lossy compression in linear spaces). In principle, any tensor can be approximated by a tensor network with arbitrary precision [30], however the size of the constituent tensor factors may become prohibitively large in worst case examples, showing that the chosen tensor network delivers a poor compression. Tensor factorizations, which we also refer to as decompositions, are not unique in general and the problem of finding the optimal tensor decomposition is a difficult non-convex optimization problem [31].
In practice, tensor network factorization is typically specified by a graph in which the nodes are the tensor factors and the edges represent physical or auxiliary vector spaces which are associated with the indices of the corresponding tensor factors. In this representation, a graph node with n edges is a rank-n tensor with each of its n indices uniquely associated with a specific edge. A closed edge, that is, an edge connecting two nodes, represents a contracted index shared by two tensor factors over which a summation is to be performed. In a standalone tensor network, contracted indices are associated with auxiliary vector spaces. An open edge, that is, an edge connected to only one node, represents an uncontracted index of that corresponding tensor factor. Uncontracted indices in a standalone tensor network are typically associated with physical vector spaces. Different tensor network architectures differ by the topology of their representative graphs. Furthermore, one can define even more general tensor network architectures by replacing graphs with hypergraphs, in which case an edge may connect three or more tensors, thus representing a summation over the repeated index in three or more tensors, respectively. In the subsequent discussion, however, we will only deal with conventional graph topologies.
A quantum many-body wave-function, including a multi-qubit wave-function, is essentially a high-rank tensor (its rank is normally equal to the number of simulated quantum particles, quasi-particles, or quantum spins) [29]. A number of different tensor network architectures have been suggested for the purpose of factorizing quantum many-body wave-functions, including the matrix-product state (MPS) [32,33], the projected entangled pair state (PEPS) [34,35], the tree tensor network state (TTNS) [36][37][38], the multiscale entanglement renormalization ansatz (MERA) [39,40], as well as somewhat related non-conventional schemes like the complete-graph tensor network (CGTN) [41]. All of the above tensor network ansaetze differ in the factorization topology, that is, in how the tensor factors are contracted with each other to form the final quantum many-body wave-function tensor. In a good tensor network factorization, the graph topology is induced by the entanglement structure of the quantum many-body state under study. Many physical systems are described by many-body Hamiltonians with only local interactions-in many cases, nearest neighbor only-with correlation functions decaying exponentially for non-critical states. In such cases, the locality structure of the many-body Hamiltonian induces the necessary topology required to properly capture the quantum correlations present in the system of interest. The factorization topology also strongly affects the computational cost associated with the numerical evaluation/optimization of a specific tensor network architecture. Another important characteristic of a tensor network is its so-called maximal bond dimension, χ, that is, the maximal dimension of auxiliary vector spaces (auxiliary vector spaces are those contracted over). Provided that the maximal bond dimension is bounded, many tensor network factorizations can be efficiently evaluated due to a moderate polynomial computational cost (in the bond dimension) associated with the computed physical expectation values. In practice, the entanglement structure of the underlying quantum many-body state determines the maximal bond dimension in a given tensor network for a given error tolerance. A poorly chosen tensor network topology will necessarily lead to rapidly increasing (exponentially at worst) bond dimensions in order to keep the factorization error within the tolerable error threshold [30].
The entanglement structure in a multi-qubit wave-function is determined by the quantum circuit and may be unknown in general. Consequently, there is no well-defined choice of a tensor network architecture (topology) that could work equally well for all quantum circuits, unless it is some kind of an adaptive topology. In practice, the choice of a tensor network architecture for representing a multi-qubit wave-function is often dictated by numerical convenience and ease of implementation. For example, one of the simplest tensor network architectures, the MPS ansatz illustrated in Fig 1, was used to simulate Shor's algorithm for integer factorization [42]. Although the inherently one-dimensional chain topology of the MPS ansatz often results in severely growing bond dimensions, and this can be remedied by a more judicious tensor network form [38], its computational convenience and well understood theory makes the MPS factorization an appealing first candidate for our quantum virtual machine (quantum circuit simulator). In future, we plan on adding more advanced tensor network architectures, however.
In order to simulate a general quantum circuit over an N-qubit register with the tensor network machinery the following steps will be necessary (see Fig 2): 1. Specify the chosen tensor network graph that factorizes the rank-N wave-function tensor into a contracted product of lower-rank tensors (factors). For example, one may choose the MPS factorization as done in Fig 2. 2. Transform the quantum circuit into an equivalent quantum circuit augmented with SWAP gates in order to maximize the number of accelerated gate applications (see below). This is an optional step. 3. Group quantum gates into larger aggregates (super-gates) which will act as a whole on the relevant part of the multi-qubit wave-function. In the simplest case, all elementary quantum gates will be distinguished individually, with no aggregation. The purpose of the aggregation step is to reduce the total number of gates in the quantum circuit and to increase the compute intensity associated with their action on the wave-function tensor network. This is an optional step.
4. Apply aggregated super-gates (or individual gates when no aggregation occurred) to the relevant parts of the wave-function tensor network, thus evolving towards the output state.
Multiple super-gates can be applied simultaneously. In general, the application of a supergate will affect the qubits associated with that super-gate as well as possibly other qubits affected indirectly because of the specific form of the tensor network. For example, in the MPS factorization case the application of a 2-body super-gate to qubits i and j may need to also update the MPS tensors associated with the internal qubits located between the qubits i and j.
In the above general algorithm, the action of the super-gates (or just individual gates) on a multi-qubit wave-function tensor network consists of the following steps: 1. Append a given set of super-gates (or just individual gates) to the input wave-function tensor network TN inp , thus obtaining a larger tensor network TN mid . An n-body super-gate, represented by a rank-2n tensor, is appended to the tensor network graph by pairing its n edges with the corresponding edges of the tensor network that are associated with the qubits the super-gate acts on. This is illustrated in the bottom part of Fig 2 where one or more super-gates are appended to the input tensor network depicted as a graph of circles on the left (followed by a closure by the output tensor network depicted as a graph of circles on the right).
2. If there are 2-or higher-body super-gates present, check whether they are applied to the qubit pairs or triples, etc. that allow accelerated gate application (for example, in MPS factorization, these would be the adjacent qubit pairs, triples, and so on). If yes, evaluate their action in an accelerated fashion (see below). Otherwise, resort to the general algorithm in the next steps.
3. Instantiate a new tensor network TN out by cloning TN inp and complex conjugating all constituent tensors.
4. Close TN mid with TN out by pairing all edges between the two tensor networks, thus obtaining a closed tensor network TN opt which does not have open edges (see the bottom part of Fig 2). TN opt evaluates to a scalar since all tensor indices have been contracted. It represents the inner product between the state obtained by the action of the super-gate(s) on the input tensor network and the state parameterized by the output tensor network. Thus, we can approximate the state obtained by the action of the super-gate(s) on the input tensor network by maximizing the obtained inner product while keeping the output tensor network normalized.
5. In the obtained inner product, contract away some or, if possible, all tensors that will not undergo any changes in value, thus reducing the total number of tensors in TN opt (see Fig  3). As mentioned above, depending on the tensor network architecture, only a subset of output tensors may need to be updated for a given super-gate(s) application. Note that such a simplification of TN opt is not always practical as the contraction of the unaffected tensors may yield unmanageably large higher-rank tensors, in which case it should be abandoned, either partially or fully.
6. Optimize the affected tensors of TN out to maximize the TN opt scalar, subject to keeping TN out normalized. For example, one can follow a general algorithm suggested in Ref. [40] where the optimization procedure is based on the evaluation of the gradients of TN opt with respect to each optimized tensor of TN out and using SVD for keeping the updated constituent tensor factors isometric, which should preserve their normalization as a byproduct (isometric tensors are composed of orthonormal columns when they are flattened in a matrix over specific tensor modes). Alternatively, one can add the output tensor network normalization condition (as well as other necessary conditions on tensor factors) to the inner product TN opt to get a constrained optimization problem, as shown in Fig 4. In practice, the optimized output tensors are updated one or two (if adjacent) at a time, based on the corresponding constrained gradients, and the full optimization epoch includes a sweep over all optimized tensors. The advantage of optimizing a pair of connected tensor factors at a time is that one can perform an SVD on the combined tensor to get the updated tensor factors.
7. If the maximum TN opt value is not acceptable after some predefined number of iterations, increase dimensions of the auxiliary spaces in the affected tensors of TN out and repeat Step 6.
In cases where an accelerated gate application is possible (for example, a 2-body gate is applied to the adjacent qubits in the MPS-factorized wave-function), one can restrict the update procedure only to the tensor factors directly affected by the gate action. In case of MPS factorization, in order to apply a 2-body gate to two adjacent qubits one can contract the gate tensor with the two MPS tensors representing the affected qubits and then perform a singular value decomposition (SVD) on the tensor-result, thus immediately obtaining new (updated) MPS tensors as illustrated in Fig 5. The above general algorithm demonstrates the procedure the TNQVM simulator is designed to use for approximate simulation of quantum circuits based on the tensor network factorizations. For the sake of completeness, we should also mention quantum circuit simulators which use tensor representations for a brute-force simulation of quantum circuits with no approximations [27,43]. This is different from our approach which is based on the explicit factorization of the multi-qubit wave-function tensor. In these other tensor-based schemes the entire quantum circuit as a collection of gate tensors is considered as a tensor network which is subsequently contracted over in order to compute observables or evaluate output probability distributions. In Ref. [27], a clever tensor slicing technique was introduced that avoided the evaluation of the full wave-function tensor, thus reducing the memory footprint and bypassing the existing 45-qubit limit on large-scale HPC systems. Yet, despite enabling simulations of

Quantum virtual machines
In order to evaluate the correctness of a quantum program and its implementation via a decomposition into primitive gate operations, it is necessary to model both the conventional computing and quantum computing elements of the system architecture. In particular, it is necessary to expose the interface to the available instruction set architecture (ISA) and methods to support quantum program execution, scheduling, and layout. There are currently many different technologies available for testing and evaluating quantum processing units, and each of these technologies presents different ISAs and methods for program execution [44].
As shown in Fig 6, a quantum virtual machine (QVM) provides a portable abstraction of technology-specific details for a broad variety of heterogeneous quantum-classical computing architectures. The hardware abstraction layer (HAL) defines a portable interface by which the underlying quantum processor technology as well as other hardware components such as memory are exposed to system libraries, runtimes and drivers running on the host conventional computer. The implementation of the HAL provides an explicit translation of quantum program instructions into native, hardware-specific syntax, which may be subsequently executed by the underlying quantum processor. The HAL serves as a convenience to ensure portability of programs across different QPU platforms, while the QVM encapsulates the environment in which applications can be developed independently from explicit knowledge of QPU details. This environment is provided by the integration of the HAL with programming tools, libraries, and frameworks as well as the host operating system.
Application performance within a QVM depends strongly on the efficiency with which host programs are translated into hardware-specific instructions. This includes the communication overhead between the HAL and hardware layers as well as the overhead costs for managing these interactions by the host operating system. Both algorithmic and hardware designs impact this performance by deciding when and how to allocate computational burden to specific devices. Presently, there is an emphasis on the development and validation of hybrid programs, which loosely integrates quantum processing with conventional post-processing tasks. This algorithmic design introduces a requirement for transferring memory buffers between the host and QPU systems. Memory management therefore becomes an important task for application behavior. While current QPUs are often accessed remotely through network interfaces, long-term improvements in application performance will require fine grain control over memory management.

Tensor network quantum virtual machine
Our implementation of a QVM presented in this work is based on a previously developed hybrid quantum-classical programming framework, called XACC [25], combined with a quantum circuit simulator that uses tensor network theory for compressing the multi-qubit wavefunction. We provide an overview of the Tensor Network Quantum Virtual Machine (TNQVM) and its applications, including its software architecture and integration with the XACC programming framework. Since XACC integrates directly with TNQVM, compiled programs can in principle be verified instantaneously on any classical computer including workstations as well as HPC clusters and supercomputers. The support of different classical computer architectures (single-core, multi-core, GPU, distributed) for performing numerical simulations is achieved by interchangeability of the numerical backends in our TNQVM simulator. These backends are numerical tensor algebra libraries which perform all underlying tensor computations on a supported classical computer. In this work, we detail the HAL implementation of TNQVM using ITensor [45] for serial simulations, with some example applications demonstrating the utility of TNQVM. We also sketch some details on the upcoming ExaTENSOR backend that will enable large-scale quantum circuit simulations on distributed homo-and heterogeneous HPC systems. Independent verification of hybrid programs within TNQVM provides an increased confidence in the use of these codes to characterize and validate actual QPUs.

XACC
The eXtreme-scale ACCelerator programming model (XACC) has been specifically designed for enabling near-term quantum acceleration within existing classical high-performance Fig 6. A schematic design how a quantum virtual machine (QVM) manages access to an underlying QPU through the hardware abstraction layer. A program binary exists within an application framework that accesses system resources through libraries. Library calls are managed by the host operating system, which manages and schedules requests to access hardware devices including attached QPUs. The hardware abstraction layer (HAL) provides a portable interface by which these requests are made to the underlying QPU technology. computing applications and workflows [25,46]. This programming model and associated open-source reference implementation follow the traditional co-processor model, akin to OpenCL or CUDA for GPUs, but takes into account the subtleties and complexities arising from the interplay between classical and quantum hardware. XACC provides a high-level application programming interface (API) that enables classical applications to offload quantum programs (represented as quantum kernels, similar in structure to GPU kernels) to an attached quantum accelerator in a manner that is agnostic to both the quantum programming language and the quantum hardware. Hardware agnosticism enables quantum code portability and also aids in benchmarking, verification and validation, and performance studies for a wide array of virtual (simulators) and physical quantum platforms.
To achieve language and hardware interoperability, XACC defines three important abstractions: the quantum intermediate representation (IR), compilers, and accelerators. XACC compiler implementations map quantum source code to the IR-the in-memory object key to integrating a diverse set of languages to a diverse set of hardware. IR instances (and therefore compiled kernels) are executed by sub-types of the accelerator, which defines an interface for injecting physical or virtual quantum hardware. Accelerators take this IR as input and delegate execution to vendor-supplied APIs for the QPU, or an associated API for a simulator. This forms the hardware abstraction layer, or abstract device driver, necessary for a general quantum (virtual) machine.
The IR itself can be further decomposed into instruction and function abstractions, with instructions forming the foundation of the IR infrastructure and functions serving as compositions of instructions (see Fig 7). Each instruction exposes a unique name and the set of qubits that it operates on. Functions are a sub-type of the instruction abstraction that can contain further instructions. This setup, the familiar composite design pattern [47], forms an n-ary tree of Validating quantum-classical programming models with tensor network simulations instructions where function instances serve as nodes and concrete instruction instances serve as leaves.
Operating on this tree and executing program instructions is a simple pre-order traversal on the IR tree. In order to enhance this tree of instructions with additional functionality, XACC provides a dynamic double-dispatch mechanism, specifically an implementation of the familiar visitor pattern [48]. The visitor pattern provides a mechanism for adding virtual functions to a hierarchy of common data structures dynamically, at runtime, and without modifying the underlying type. This is accomplished via the introduction of a visitor type that exposes a public set of visit functions, each one taking a single argument that is a concrete sub-type of the hierarchical data structure composition (see Fig 7). For gate model quantum computing, XACC exposes a visitor class that exposes a visit method for all concrete gate instructions (X, H, RZ, CX, etc. . .). All instructions expose an accept method that takes as input a general visitor instance, and invokes the appropriate visit method on the visitor through double-dispatch. XACC instruction visitors thereby provide an extensible mechanism for dynamically operating on, analyzing, and transforming compiled IR instances at runtime.

Tensor network accelerator and instruction visitors
The integration of a tensor network quantum circuit simulator with XACC can be accomplished through extensions of appropriate XACC components. In essence, this is an extension of the quantum virtual machine hardware abstraction layer that enables existing high-level programs and libraries to target a new virtual hardware instance. Injecting new simulators into the XACC framework requires a new implementation of the accelerator. Enabling that simulator to be extensible in the type of tensor networks, algorithmic execution, and the library backend requires different mappings of the IR to appropriate simulation data structures. This can be accomplished through individual implementations of the instruction visitor.
Our open-source implementation of the Tensor Network Quantum Virtual Machine (TNQVM) library extends the XACC accelerator with a new derived class that simulates purestate quantum computation via tensor network theory [49]. This library provides the TNAccelerator (Tensor Network Accelerator) that exposes an execute method that takes as input the XACC IR function to be executed. Generality in the tensor network graph structure and the simulation algorithm is enabled through appropriate implementations of the instruction visitor. For example, an instruction visitor can be implemented to map the incoming XACC IR tree to tensor operations on a matrix product state (MPS) ansatz. Walking the IR tree via pre-order traversal and invoking the instruction visitor accept mechanism on each instruction triggers invocation of the appropriate visit function via double dispatch. The implementation of these visit methods provides an extensible mechanism for performing instructionspecific tensor operations on a specific tensor network graph structure.
Furthermore, this visitor extension mechanism can be leveraged to not only provide new tensor network structures and operations, but also provide the means to leverage different tensor algebra backend libraries, and therefore introduce a classical parallel execution context. Different visitor implementations may provide a strictly serial simulation approach, while others can enable a massively parallel or heterogeneous simulation approach (incorporating the Message Passing Interface, OpenMP, and/or GPU acceleration via CUDA or OpenCL).
To date we have implemented two instruction visitor backends for the TNQVM and the TNAccelerator. We have leveraged the ITensor library [45] to provide a serial matrix product state simulator, and the ExaTENSOR library from the Oak Ridge Leadership Computing Facility (OLCF) to provide a matrix product state simulator that leverages MPI, OpenMP and CUDA for distributed parallel execution on GPU-accelerated heterogeneous HPC platforms. However, the ExaTENSOR visitor is not fully available yet since the ExaTENSOR library is currently undergoing final testing before its public release and the implementation of the generic tensor optimization procedure is still in progress. Thus, it has not been utilized yet as a fully functional backend of TNQVM. Nevertheless, we will provide some details on the Exa-TENSOR backend below in order to highlight its design and our future plans.

ITensor MPS implementation.
The ITensor MPS instruction visitor implementation provides a mechanism for the simulation of an N-qubit wavefunction via a matrix product state tensor network decomposition. The MPS provides a way to restrict the entanglement entropy through SVD and associated truncation of Schmidt coefficients to reduce the overall Schmidt rank. With these MPS states, we need O(nχ 2 ) numbers to represent n qubits, where χ is the largest Schmidt rank we keep. As long as χ is not too large (grows polynomially with system size), the space complexity is feasible for classical simulation. For example, if the quantum register is used to store the gapped ground states of systems with local interactions, we can simulate larger number of qubits and still adequately approximate the wavefunction by keeping χ small enough.
Our ITensor MPS visitor implementation begins by initializing a matrix product state tensor network using the serial tensor data structures provided by the ITensor library [45]. Simulation of the compiled IR program is run through a pre-order tree traversal of the instruction tree. At each leaf of this tree (a concrete instruction), the accept method on the instruction is invoked (see Fig 7) which dispatches a call to the correct visit method of the instruction visitor.
At this point, the appropriate gate tensor is contracted into the MPS representation, which maps onto itself under local quantum gates. Updating the MPS according to two-body entanglers involves two-qubit gates which act on two rank-3 tensors, and the full contraction results in a rank-4 tensor (see Fig 5). We maintain the MPS structure by decomposing the rank-4 tensor into two rank-3 tensors and a diagonal matrix between them. Note that when the two qubits are not adjacent we apply SWAP gates on intermediary qubits to bring them together (see Fig 2). The gate is then applied and reverse SWAPs bring the qubits back to the original positions. Otherwise, applying a gate to non-adjacent qubits would require using the general tensor network optimization algorithm described in the previous section, which we do not do yet.
The SVD is used to return the resulting rank-4 tensor to the canonical MPS form (n rank-3 tensors and n − 1 diagonal matrices), with the singular values below a cutoff threshold � (e.g., default is � = 10 −4 ) being truncated. The truncation over subspaces supporting exponentially small components of the wave-function allows our MPS-based TNQVM to simulate large numbers of qubits, contingent on the level of entanglement in the system. Examples and discussion may be found in the demonstrations in Sec. 5.

ExaTENSOR MPS implementation.
The ExaTENSOR numerical tensor algebra backend will enable larger-scale TNQVM quantum circuit simulations on distributed, GPUenabled and other accelerated as well as conventional multicore HPC platforms. ExaTENSOR stores tensors in distributed memory (on multiple/many nodes) as a generally sparse collection of tensor slices in a hierarchical fashion, that is, a tensor is defined recursively as a collection of constituent tensors (the hierarchical storage is the key when dealing with heterogeneous HPC architectures). Distributed tensor storage lifts the memory limitations pertinent to a single node, thus extending the maximal number of simulated qubits. Although we currently target the (distributed) MPS implementation, ExaTENSOR already provides a generic tensor network builder that can be used for constructing an arbitrary tensor network in future. The Exa-TENSOR MPS visitor implementation will provide a constructor that creates the MPS representation of the simulated multi-qubit wave-function with all constituent MPS tensors being distributed now. Then the XACC IR tree traversal will invoke the ExaTENSOR MPS visit method for each traversed node (instruction). The visit method implements lazy visiting, namely it only caches the corresponding instruction (gate) in the instruction cache of the ExaTENSOR MPS visitor. At some point, once the instruction cache has enough work to perform, the evaluate method of the ExaTENSOR visitor will be invoked, implementing the generic gate action algorithm described in Section II. Specifically, it will allocate the output MPS tensor network, that is, the result of the action of the cached gates on the input MPS tensor network. Then it will create the closed (inner product) tensor network by joining the gate tensors to the input MPS tensor network, subsequently closing it with the output tensor network (see Fig 2). This closed tensor network is a scalar whose value needs to be maximized, subject to normalization condition on the output tensor network. The ExaTENSOR MPS visitor will utilize the standard gradient descent algorithm by evaluating the gradients with respect to each tensor constituting the output tensor network. Each of these gradients is itself an open tensor network that needs to be fully contracted into a single tensor. Importantly, the computational cost of this contraction of many tensors strongly depends on the order in which the pairwise tensor contractions are performed. Finding the optimal tensor contraction sequence is an NP-hard problem. Instead, ExaTENSOR implements a heuristic algorithm that delivers a pseudo-optimal sequence of pairwise tensor contractions in a reasonable amount of time (subseconds). Then this pseudo-optimal sequence of pairwise tensor contractions is cached for a subsequent reuse, if needed. Given a sequence of pairwise tensor contractions, the ExaTEN-SOR library numerically evaluates all of them and returns the gradients that will subsequently be used for updating the output tensor network tensors, until the optimized inner product scalar reaches the desired threshold. In case it does not reach the desired value, the tensors constituting the output tensor network will be reallocated with increased dimensions of the auxiliary spaces and the entire procedure is to be repeated. As of now, an early prototype implementation of the ExaTENSOR MPS visitor in TNQVM is based on the single-node version of the ExaTENSOR library and we are currently finishing the integration of TNQVM with the fully distributed, GPU-accelerated version of the ExaTENSOR library as well as performing the final testing of the ExaTENSOR library itself before its public release later this year. Also, we plan to have a full support for the generic tensor optimization procedure described in

Demonstration
Here we demonstrate the utility of TNQVM by describing the overall memory scaling of our matrix product state TNQVM based on the ITensor visitor for varying levels of entanglement and system size. Our demonstrations show how TNQVM can be leveraged for validating hybrid quantum-classical programming models. Specifically we focus on random circuit simulations and the variational quantum eigensolver (VQE) hybrid algorithm.

Profiling random circuit simulations with MPS
We demonstrate the improved resource cost of representing quantum states (O(nχ 2 ) vs O(2 n )) with TNQVM by using an MPS formulation and by profiling the memory usage of simulating randomly generated circuits. We vary the entanglement structure of our random circuits by constructing time slices defined as rounds. The first round begins with a layer of Hadamard operations on all qubits, followed by a layer of single qubit gates (Pauli gates and other general rotations), followed by a set of nearest-neighbor CNOT entangling operations. Multiple rounds constitute multiple iterations of generating these layers (excluding the Hadamards, which only appear in the first layer). Clearly, later rounds add layers of entangling CNOT operations and therefore generate states with a more complicated entanglement structure.
We generate these random circuits for 5 through 85 qubits in increments of 5, and for numbers of rounds ranging from 2 through 10 in increments of 2. For each (round, n − qubits) pair, we generate 10 random circuits, compute the heap memory usage, and compute the mean and standard deviation of the memory usage. The results are plotted in Fig 8. For lightly-entangled systems (i.e. those generated by a small number of random rounds) we see that the MPS structure is able to encode the wavefunction of the system efficiently with a small cost. For example, for only two rounds the maximum bond dimension is χ = 4, which is independent of system size. As we increase the entanglement in our random circuits, the computational cost of the MPS simulations increases exponentially. This is because the circuits we have sampled from are designed to exponentially increase the entanglement which saturates at χ max = 2 n/2 for an n-qubit system undergoing m > n random rounds [50].

Variational quantum eigensolver
Finally, we demonstrate the utility of our tensor network simulation XACC Accelerator backend (the TNQVM library) in validating quantum-classical algorithms. It is this rapid feedback mechanism that is critical to understanding intended algorithmic results, and enables confidence in the programming of larger systems. Here we demonstrate this programmability and its verification and validation through a simple simulation of diatomic hydrogen via the variational quantum eigensolver algorithm. The quantum-classical program leveraging the TNQVM library is shown in the listing in Fig 9. This code listing demonstrates the integration of XACC and our tensor network accelerator implementation. The code shows how to Validating quantum-classical programming models with tensor network simulations program, compile, and execute the VQE algorithm to compute expectation values for the simplified (symmetry-reduced), two qubit H 2 Hamiltonian (see [51]). We start off by defining the quantum source code as XACC quantum kernels (note-we have left out a few measurement kernels for brevity). Each of these kernels is parameterized by a single double representing the variational parameter for the problem ansatz circuit (the ansatz kernel in the h2_src string). Integration with the TNQVM simulation library is done through a public XACC API Validating quantum-classical programming models with tensor network simulations function (getAccelerator). This accelerator reference is used to compile the program and get reference to executable kernels that delegate work to the TN Accelerator. We then loop over all θ and compute the expectation values for each Hamiltonian measurement term. Notice that this execution mechanism is agnostic to the accelerator sub-type. This provides a way to quickly swap between validation and verification with TNQVM, and physical hardware execution on quantum computers from IBM, Rigetti, etc.

Conclusion
In this work we have discussed the concept of a general quantum virtual machine and introduced a concrete implementation of the QVM that enables quantum-classical programming with validation through an extensible tensor network quantum circuit simulator (TNQVM). We have discussed the applicability and scalability of a matrix product state backend implementation for TNQVM and discussed the role of TNQVM in benchmarking quantum algorithms and hybrid quantum-classical applications including random circuit sequences used in quantum supremacy [50] and the variational quantum eigensolver [7]. We have chosen a tensor network based quantum virtual machine due to the complexity reduction such a formalism provides for a broad range of problems. In general TNQVM enables large-scale simulation of quantum circuits which generate states characterized by short-range entanglement. Studying systems with long-range entanglement interactions will require further developments in implementing more advanced tensor network decomposition types. We plan to investigate the applicability of the tree tensor network and the multiscale entanglement renormalization ansatz in future work, in an effort to scale simulation capabilities to a larger number of qubits. We will also finish the deployment of the massively parallel ExaTENSOR visitor backend in order to exploit large-scale HPC platforms in the future.