Nonnegative/Binary matrix factorization with a D-Wave quantum annealer

D-Wave quantum annealers represent a novel computational architecture and have attracted significant interest. Much of this interest has focused on the quantum behavior of D-Wave machines, and there have been few practical algorithms that use the D-Wave. Machine learning has been identified as an area where quantum annealing may be useful. Here, we show that the D-Wave 2X can be effectively used as part of an unsupervised machine learning method. This method takes a matrix as input and produces two low-rank matrices as output—one containing latent features in the data and another matrix describing how the features can be combined to approximately reproduce the input matrix. Despite the limited number of bits in the D-Wave hardware, this method is capable of handling a large input matrix. The D-Wave only limits the rank of the two output matrices. We apply this method to learn the features from a set of facial images and compare the performance of the D-Wave to two classical tools. This method is able to learn facial features and accurately reproduce the set of facial images. The performance of the D-Wave shows some promise, but has some limitations. It outperforms the two classical codes in a benchmark when only a short amount of computational time is allowed (200-20,000 microseconds), but these results suggest heuristics that would likely outperform the D-Wave in this benchmark.


Introduction
Single-core computational performance relentlessly improved for decades, but recently that progress has begun to slow [7].As a result, alternative computational architectures have sprung up including multi-core processors [8], graphic processing units [9], neuromorphic computing [10], and application-specific integrated circuits to name a few.Here we explore the use of another new architecture: quantum annealing [11].In particular, we utilize the form of quantum annealing realized with D-Wave hardware [1,3].We focus on a machine learning problem based on matrix factorizations, and describe an algorithm for computing these matrix factorizations that leverages D-Wave hardware.We apply the algorithm to learn features in a set of facial images.
There is an ongoing back-and-forth regarding whether or not D-Wave's hardware provides performance benefits over classical single-core computing [12,13,14,15].Here, we benchmark the performance of two state-of-theart classical approaches against the performance of the D-Wave.Much of the debate about the D-Wave's performance has centered on problems that are custom-tailored to fit D-Wave's hardware.A component of the matrix factorization problems that we study here can be solved on the D-Wave, but is not customized for the D-Wave and represents a real rather than synthetic problem.The D-Wave outperforms the two classical approaches in a benchmark, but the performance of the two classical approaches suggests a heuristic that could be implemented on a classical computer to outperform the D-Wave in this benchmark.This mixed result on the performance of the D-Wave compared to classical tools is in agreement with the most recent results [15] showing that, even for custom-tailored problems, the D-Wave does not outperform the best classical heuristics.Despite this, our results show that the D-Wave can outperform very good classical tools when only a short amount of computational time is allotted to solve these real-world problems.We also provide a discussion of how future improvements to the algorithm presented here and D-Wave's hardware could improve performance for these matrix factorization problems.
The remainder of this manuscript is organized as follows.Section 2 describes the methods used to solve the matrix factorization problems and perform the benchmarking.Section 3 describes the results we obtained in solving a matrix factorization problem for a set of facial images and the benchmark results.Finally, section 4 discusses the results and indicates how future developments might improve the performance of the D-Wave for this problem.

Methods
We seek to represent an n × m matrix, V , as the product of two matrices, W and H, where W is an n × k matrix and H is a k × m matrix.That is, we wish to find W and H such that We impose constraints on W and H.In particular, the components of W must be nonnegative (i.e., W ij ≥ 0) and the components of H must be binary (i.e., H ij ∈ {0, 1}).Since W is a nonnegative matrix and H is a binary matrix, we describe this matrix factorization as Nonnegative/Binary Matrix Factorization (NBMF).This is in contrast to Nonnegative Matrix Factorization (NMF) [16] where H is allowed to take on any nonnegative value, not just 0 or 1.To satisfy equation 1, we utilize an alternating least squares algorithm [17] (see Algorithm 1).
where || • || F is the Frobenius norm.Note that equation 2 can be solved by solving a set of independent optimization problems (one for each column of H).This is because the variables in the i th column of H impact only the i th column of W X, and the variables outside the i th column of H do not impact the i th column of W X. That is, if we denote the i th columns of H and V by H i and V i , respectively, then for i = 1, 2, . . ., m.This means that we can solve for H by solving a series of linear least squares problems in binary variables.This type of problem can be readily solved on D-Wave hardware [18], as long as the number of binary variables is small.Equation 2 involves km binary variables, but equation 3 involves only k binary variables.Since the D-Wave 2X imposes severe limitations on the number of binary variables that can be dealt with, this reduction in the number of variables is crucial.In practice, this means that relatively large datasets can be analyzed (i.e., n and m can be large).Since the number of variables the D-Wave works with at any given time is determined only by k, the D-Wave imposes restrictions only on the number of features (i.e., k).
We also compare the performance of the D-Wave to solve equation 3 with two classical approaches.One, called qbsolv [19], utilizes is an efficient, opensource implementation of tabu search [20,21].The other utilizes the JuMP [22] modeling language and a state-of-the-art mathematical programming tool called Gurobi [23].The performance is compared using a cumulative time-totargets benchmark that is a variation of the time-to-target benchmark [24].In the course of executing algorithm 1, equation 3 must be solved many times for different values of i and W .Each time this equation is solved, the D-Wave is given a fixed number of annealing cycles to minimize ||V i − W q|| 2 (we look at examples where the number is fixed at 10, 10 2 , 10 3 , and 10 4 ).Each annealing cycle results in one approximate solution to equation 3, and we denote the best of these approximate solutions by H * i .The best solution, is used to generate a target value for the objective function, , that the classical approaches (qbsolv and Gurobi) must match.The cumulative time-totargets benchmark computes the cumulative amount of time it takes for qbsolv or Gurobi to find a solution that is at least as good (in terms of as the best solution found by the D-Wave for each instance of equation 3 that is encountered in executing algorithm 1.This cumulative time-to-targets is then compared with the amount of annealing time used by the D-Wave.If qbsolv or Gurobi take more than 10 minutes to reach an individual target, the time to reach that target is set to 10 minutes.The 10 minute limit is an expedient that enables the analysis to be run in a reasonable amount of time.Gurobi never reached the 10 minute limit, but qbsolv did in a number of cases.

Programming the D-Wave
D-Wave quantum annealers deal natively with quadratic, unconstrained, binary optimization (QUBO) problems [25].These problems are associated with objective functions that have the form where q = (q 1 , q 2 , . . ., q n ).One might think of a 0 th order approximation of the D-Wave's behavior as being that each anneal returns a vector, q, so that f (q) is minimized.A 1 st order approximation of the D-Wave's behavior is that each anneal returns a sample, q, from a Boltzmann distribution where f (q) is the energy.Both of these approximations are inexact, but highlight the basic behavior of the D-Wave: each annealing cycle returns a sample, q, which tends to make f (q) small.Equation 3 can be readily put into the form of a QUBO by setting [18] Having reformulated equation 3 in this way, the quadratic coefficients in the QUBO, b ij , are generally all nonzero.However, the D-Wave's hardware imposes sparsity constraints on the b ij .These constraints can be overcome via embedding [26,27], where multiple physical qubits are used to represent a single binary variable.Since the b ij 's in our problems are never exactly zero, a complete graph with the number of nodes equal to the number of binary variables must be embedded in the graph imposed by the D-Wave 2X chip.If a D-Wave 2X chip had no defects (i.e., if all the qubits and couplers were available for use), the maximum number of binary variables that could be used for these problems is 49 (if the matrix formed by the elements b ij had some natural sparsity to it, this number would increase).However, some of the qubits and couplers on these chips are not available for use, and as the number of binary variables increases, the number of physical qubits required to represent each variable increases.Since using a larger number of physical qubits to represent a single binary variable is associated with poor performance, we limit our study to 35 binary variables.In this case, we found an embedding where each binary variable is represented by at most 19 physical qubits.

Results
We analyzed the same set of 2,429 facial images that was previously analyzed to learn the parts of faces using nonnegative matrix factorization [16].Figure 1 shows the features that were learned using algorithm 1 with 10,000 anneals per solve of equation 3 and how those features are used to reconstruct the image of a face.Some of the features in figure 1 may appear to be all black, but they actually contain subtle features such as a bright spot in the lower-left corner or a shiny cheek/nose (see figure 2).Unlike NMF where the parts of faces are learned [16], the features learned by NBMF are holistic.One can view NMBF as being a method that is somewhere in between the NMF and vector quantization methods considered in [16].Like NMF, it imposes the nonnegativity constraints on W , but unlike NMF imposes binary constraints on H. Vector quantization imposes binary constraints on H, but adds an additional constraint that each column of H can only contain one nonzero entry.This additional constraint causes vector quantization to learn holistic, prototypical faces.NBMF appears to learn features that are holistic like the features that come out of vector quantization, but, unlike vector quantization, NBMF's features are not necessarily prototypical faces.Many of them appear ghostly and the ones that are mostly black are even more subtle.As in figure 1, these subtle and ghostly features can be combined to reproduce a face.The NBMF method employed here has some advantages and disadvantages compared to NMF.One advantage of NBMF is that H is more sparse when NBMF is used than when NMF is used.Analyzing this database of facial images using 35 features (k = 35), the H produced by NBMF is approximately 85% sparse (i.e., 85% of the elements of H are zero) whereas the one produced by NMF is approximately 13% sparse.Further, the storage requirements for each component of H are less for NBMF (1 bit) than NMF (e.g., 64 bit floating point numbers were used here).A disadvantage is that ||V − W H|| F is larger for NBMF than NMF.For this database of facial images, ||V − W H|| F using NMF was about 46% of this norm when using NBMF.In layman's terms, NMF had about half as much error as NBMF.NMF has the additional advantage that W is about 41% sparse, whereas W is dense for NBMF in the images analyzed here.
Figure 3 shows the results of the cumulative time-to-targets benchmark.The cumulative time-to-targets for qbsolv always exceeds the cumulative annealing time by a factor of 20-50 depending on the number of anneals.In the test with 10, 100, and 1,000 annealing cycles, Gurobi's cumulative time-to-targets exceeds the cumulative annealing time by factors of about 61, 7 and 1.2, respectively.In the test with 10,000 anneals, Gurobi's cumulative time-to-targets was less than the cumulative annealing time by a factor of about 6.4.
Gurobi and qbsolv show different performance trends in this benchmark that we explore in more detail with figure 4. Qbsolv's performance is characterized by frequently reaching the target before the annealing time for individual problems, but, when it fails to reach the target before the annealing time, it can take a comparatively long time to reach the target.These problems where qbsolv takes a very long time to reach the target make up a large portion of the cumulative time-to-targets.As the D-Wave takes more and more samples, the fraction of the problems where the time-to-target exceeds the annealing time increases, as can be seen from the increasing number of red dots above the orange line in figure 4. When 10 anneals are used, qbsolv's time-to-target exceeds the annealing time Figure 1: Face image reconstruction using features learned by NBMF.The fiveby-seven matrix of images on the right shows the features that were learned.The two images on the left show the original image (top) and the reconstruction (bottom).The reconstruction is obtained by summing the features that are boxed in green.Note that although some of the features appear to be all black, they actually contain facial features are small in magnitude (black corresponds to 0, white corresponds to 1).3: Cumulatve time-to-targets for qbsolv (red) and Gurobi (blue) as a function of the number of annealing cycles executed by the D-Wave.When the number of anneals is 10, 100, or 1,000, the cumulative time-to-targets for both qbsolv and Gurobi exceeds the cumulative annealing time (orange).When the number of anneals is 10,000, the cumulative time-to-targets for qbsolv exceeds the cumulative annealing time, but Gurobi's cumulative time-to-target is less than the cumulative annealing time.Note that in the 10, 100, and 1,000 anneal cases, 24,290 QUBOs were solved whereas only 19,432 QUBOs were solved in the 10,000 anneal cases.This was caused by earlier termination of the NBMF (algorithm 1).less than 1% of the time, whereas when 10,000 anneals are used, qbsolv's timeto-target exceeds the annealing time in more than 28% of the problems.Gurobi rarely has problems for which it takes a very long time to reach the target set by the D-Wave.In the more than 90,000 problems considered here, Gurobi took more than a second to reach the target set by D-Wave only 21 times with the maximum time being 13.6 seconds.By contrast, qbsolv took more than a second to reach the target 5,509 times and hit the 10 minute maximum in 24 cases.While Gurobi rarely takes a long time to reach the target set by the D-Wave, it also rarely solves the problems very quickly.Gurobi reached the target set by the D-Wave in under a millisecond in only 57 cases, whereas qbsolv did this in almost 80,000 cases.In short, Gurobi's performance is characterized by consistency and qbsolv's performance is characterized by solving many problems very quickly and some problems slowly.

Discussion
We have identified a performance regime (fast solutions that are good, but not necessarily optimal) where the D-Wave, at least by the benchmark used here, outperforms two state-of-the-art classical codes in a problem that is not custom-tailored to the D-Wave as other problems have been [2,28].This was demonstrated by the cases we studied with 10, 100, and 1,000 anneals where the cumulative time-to-targets for both Gurobi and qbsolv exceeded the annealing time used by the D-Wave.We emphasize that these results do not demonstrate any sort of quantum supremacy.In fact, they suggest a classical heuristic which would likely outperform the D-Wave: run qbsolv for a short period and if it fails to match the target, then switch to Gurobi.This would leverage qbsolv's ability to frequently outperform the D-Wave, and Gurobi's ability to never lose too badly to the D-Wave.Other benchmarks could also be used that might cast the D-Wave in a more negative light.For example, allowing the D-Wave to set the targets in the cumulative time-to-targets benchmark might be an advantage, and reversing the roles (i.e., allowing Gurobi or qbsolv to set the targets in a fixed amount of computational time) might produce different results.
Given the remarkable performance improvements over many generations of classical microprocessors [7] and the impressive algorithmic improvements in mixed-integer programming tools like Gurobi [29] over the past several decades, it is surprising that D-Wave's third generation hardware and our straightforward algorithm can be competitive at all.In the series of four chips that D-Wave has released, the number of qubits has approximately doubled from one generation to the next while the number of couplers per qubit has remained essentially unchanged.D-Wave's fifth generation chip is expected to at least double the number of couplers per qubit [30,3].If this comes to fruition, it would likely have a significant, positive impact on the performance of the D-Wave for the problems we consider here.If the number of binary variables were fixed at 35, it would result in fewer physical qubits being used to represent each binary variable.This means that, if the size of the problem is fixed, we would expect much better performance from D-Wave's fifth generation chip.Increasing the number of couplers per qubit would also enable the possibility of going beyond the 35 binary variables considered here without using an excessive number of physical qubits to represent each binary variable.That is, we would expect D-Wave's fifth generation chip to be capable of solving NBMF problems with more features than can be solved with the third generation chip that we have used here.In addition to the improvements that are anticipated in future D-Wave hardware, there are techniques that could be utilized to potentially improve our algorithm.Symmetries could be exploited (e.g., via spin reversal transformations or symmetry in the complete graph that we embed in D-Wave's hardware graph), the strength of chains arising from the embedding process could be optimized, hardware biases could be learned [31] and exploited, and the embeddings could be improved by setting quadratic coefficients in the QUBO that are approximately zero to exactly zero.Exploring and exploiting these techniques is beyond the scope of this manuscript, but we expect that some or all of them could provide significant algorithmic improvements.
The relatively short computational time (from 200µs up to 20, 000µs) performance regime where the D-Wave outperforms qbsolv and Gurobi in our benchmark could be important for big data problems.For example, when learning the features of a large set of images (much larger than the 2,429 images considered here), only a small amount of computational time may be available to solve each problem given in equation 3.However, in order for the D-Wave's performance advantages in this regime to be beneficially leveraged, there must be significant performance improvements in the time it takes to get problems into and solutions out of the D-Wave.At present, input and output is performed via the HTTPS internet protocol.The performance of this bottleneck can clearly be improved.Beyond this, there are other potential bottlenecks that could prevent the D-Wave from being more performant for these types of problems, such as the D-Wave's programming time ("qpu programming time", as reported by D-Wave's software) which was typically is about 15 milliseconds in the problems analyzed here.If this programming time remains constant as new D-Wave chips become available, then there would not be much advantage to solving problems where the total annealing time is much less than 15 milliseconds.This could hinder the D-Wave's performance in the short computational time regime we have identified where it outperforms Gurobi and qbsolv in the cumulative time-to-targets benchmark.
In summary, we have demonstrated that this NBMF algorithm can leverage the D-Wave 2X as a key component in an unsupervised machine learning analysis.Getting performance from the D-Wave 2X that is competitive with advanced classical tools on a real-world problem is a significant step forward on the journey towards practical quantum annealing.While there is still much work to be done to make these quantum annealers of practical use for this type of problem, our performance results give a glimmer of hope that this may someday be the case.

Figure 2 :
Figure 2: The same features as shown in figure 1 are shown.Here they are rescaled to maximize contrast so that the darkest pixel is black and the brightest pixel is white.

Figure
Figure3: Cumulatve time-to-targets for qbsolv (red) and Gurobi (blue) as a function of the number of annealing cycles executed by the D-Wave.When the number of anneals is 10, 100, or 1,000, the cumulative time-to-targets for both qbsolv and Gurobi exceeds the cumulative annealing time (orange).When the number of anneals is 10,000, the cumulative time-to-targets for qbsolv exceeds the cumulative annealing time, but Gurobi's cumulative time-to-target is less than the cumulative annealing time.Note that in the 10, 100, and 1,000 anneal cases, 24,290 QUBOs were solved whereas only 19,432 QUBOs were solved in the 10,000 anneal cases.This was caused by earlier termination of the NBMF (algorithm 1).

Figure 4 :
Figure 4: The time-to-target for each instance of equation 3 is shown for qbsolv (red dots), Gurobi (blue dots) in comparison to the annealing time used by the D-Wave (orange line) is shown.A different number of anneals was used in each execution of the NBMF (algorithm 1) ranging from 10 to 10,000.