Binary matrix factorization on special purpose hardware

Many fundamental problems in data mining can be reduced to one or more NP-hard combinatorial optimization problems. Recent advances in novel technologies such as quantum and quantum-inspired hardware promise a substantial speedup for solving these problems compared to when using general purpose computers but often require the problem to be modeled in a special form, such as an Ising or quadratic unconstrained binary optimization (QUBO) model, in order to take advantage of these devices. In this work, we focus on the important binary matrix factorization (BMF) problem which has many applications in data mining. We propose two QUBO formulations for BMF. We show how clustering constraints can easily be incorporated into these formulations. The special purpose hardware we consider is limited in the number of variables it can handle which presents a challenge when factorizing large matrices. We propose a sampling based approach to overcome this challenge, allowing us to factorize large rectangular matrices. In addition to these methods, we also propose a simple baseline algorithm which outperforms our more sophisticated methods in a few situations. We run experiments on the Fujitsu Digital Annealer, a quantum-inspired complementary metal-oxide-semiconductor (CMOS) annealer, on both synthetic and real data, including gene expression data. These experiments show that our approach is able to produce more accurate BMFs than competing methods.


Introduction
Many fundamental problems in data mining consist of discrete decision making and are combinatorial in nature.Examples include feature selection, data categorization, class assignment, identification of outlier instances, k-means clustering, combinatorial extensions of support vector machines, and consistent biclustering, to mention a few [1].In many cases, these underlying problems are NP-hard, and approaches to solving them therefore dependent on heuristics.Recently, researchers have been exploring different computing paradigms to tackle these NP-hard problems, including quantum computing and the development of dedicated special purpose hardware.The Ising and quadratic December 16, 2021 1/22 arXiv:2010.08693v2[cs.LG] 7 Jan 2022 unconstrained binary optimization (QUBO) models are now becoming unifying frameworks for the development of these novel types of hardware for combinatorial optimization problems.Binary matrix factorization is an NP-hard combinatorial problem that many computational tasks originating from a wide range of applications can be reformulated into.These applications include areas such as data clustering [2][3][4][5][6], pattern discovery [7,8], dictionary learning [9], collaborative filtering [10], association rule mining [11], dimensionality reduction [12], and image rendering [13].As such, any advances in solving the binary matrix factorization problem, can potentially lead to breakthroughs in various application domains.
In this paper we show how the aforementioned hardware technologies, via the QUBO framework, can be used for binary matrix factorization.As Moore's law comes to an end [14], investigating how such post-Moore's law technologies can be used is an important task.This is especially true for primitives like binary matrix factorization which are used in data mining tasks that continue to grow ever larger and more complex.We make the following contributions in this paper: • Provide two QUBO formulations for one variant of the binary matrix factorization problem.To the best of our knowledge, these are the first methods specifically designed for solving binary matrix factorization on quantum and quantum-inspired hardware to appear in the literature.We additionally propose a simple baseline method which outperforms our more sophisticated methods in a few situations.
• Show how constraints that are useful in clustering tasks can easily be incorporated into the QUBO formulations.
• Present a sampling heuristic for factorizing large rectangular matrices.
• Conduct experiments on both synthetic and real data on the Fujitsu Digital Annealer.These experiments suggest that our method is able to achieve higher accuracy than competing methods in the kind of binary matrix factorization we consider.

Binary matrix factorization
Let A ∈ {0, 1} m×n be a matrix with binary entries.For a positive integer r ≤ min(m, n), the rank-r binary matrix factorization (BMF) problem is min We discuss other definitions of BMF that appear in the literature in the section Related work.

Example 1 (Exact BMF). Define matrices
Then A = U V is an exact BMF of A.

The QUBO framework
Let Q ∈ R n×n be a matrix.A QUBO problem takes the form min x x Qx s.t.x ∈ {0, 1} n . ( December 16, 2021 The tutorial by Glover et al. [15] is a good introduction to this problem which also discusses some of its many applications.

The Digital Annealer
The Fujitsu Digital Annealer (DA) is a hardware accelerator for solving fully connected QUBO problems.Internally the hardware runs a modified version of the Metropolis-Hastings algorithm [16,17] for simulated annealing.The hardware utilizes massive parallelization and a novel sampling technique.The novel sampling technique speeds up the traditional Markov Chain Monte Carlo (MCMC) method by almost always moving to a new state instead of being stuck in a local minimum.As explained in [18], in the DA, each Monte Carlo step takes the same amount of time, regardless of accepting a flip or not.In addition, when accepting the flip, the computational complexity of updating the effective fields is constant regardless of the connectivity of the graph.The DA also supports Parallel Tempering (replica exchange MCMC sampling) [19] which improves dynamic properties of the Monte Carlo method.In our experiments, we use the DA coupled with software techniques as our main QUBO solver.

Notation
Bold upper case letters (e.g.A) denote matrices, bold lower case letters (e.g.x) denote vectors, and lower case regular and Greek letters (e.g.x, λ) denote scalars.Subscripts are used to indicate entries in matrices and vectors.For example, a ij is the entry on position (i, j) in A. A * in a subscript is used to denote all entries along a dimension.For example, a i * and a * j are the ith row and jth column of A, respectively.We use 1, 0 and I to denote a matrix of ones, a matrix of zeros, and the identity matrix, respectively.Subscripts are also used to indicate the size of these matrices.For example, 1 m×n is an m × n matrix of all ones, 1 n def = 1 n×n is an n × n matrix of all ones, and I n is the n × n identity matrix.These subscripts are omitted when the size is obvious.Superscripts in parentheses will be used to number matrices and vector (e.g.A (1) , A (2) ).The matrix Kronecker product is denoted by ⊗.The function vec(•) takes a matrix and turns it into a vector by stacking all its columns into one long column vector.The function diag(•) takes a vector input and returns a diagonal matrix with that vector along the diagonal.Semicolon is used as in Matlab to denote vertical concatenation of vectors.For example, if u ∈ R m and v ∈ R n are column vectors, then [u; v] is a column vector of length m + n.The Frobenius norm of a matrix A is defined as For positive integers n, we use the notation [n] def = {1, 2, . . ., n}.

Related work
The most popular methods for BMF are the two by Zhang et al. [2,3].Their first approach alternates between updating U and V until some convergence criteria is met.It incorporates a penalty which encourages the entries of U and V to be near 0 or 1.At the end of the algorithm, the entries of U and V are rounded to ensure they are exactly 0 or 1.Their second approach initializes U and V using nonnegative matrix factorization.For each factor matrix, a threshold is then identified, and values in the matrix below and above the threshold are rounded to 0 and 1, respectively.
December 16, 2021 3/22 Koyutürk and Grama [11] develop a framework called PROXIMUS, which decomposes binary matrices by recursively using rank-1 approximations, which results in a hierarchical representation.Shen et al. [7] provide a linear program formulation for the rank-1 BMF problem and provide approximation guarantees.Ramírez [9] presents methods for BMF applied to binary dictionary learning.Kumar et al. [13] provide faster approximation algorithms for BMF as well as a variant of BMF for which inner products are computed over the finite field of two elements (GF(2)).Diop et al. [20] propose a variant of BMF for binary matrices which takes the form A ≈ Φ(U V ), where U and V are binary, and Φ is a nonlinear sigmoid function.They use a variant of the penalty approach by [2,3] to compute the decomposition.
Boolean matrix decomposition, which is also referred to as binary matrix decomposition by some authors, is similar to the BMF in (1), but an element (U V ) ij is computed via instead of the standard inner product, where denotes disjuction.Some works that consider Boolean matrix decomposition include [4-6, 8, 10, 12, 21, 22].For a theoretical comparison between BMF, Boolean matrix factorization and a variant of BMF computed over GF (2), we refer the reader to the recent paper by DeSantis et al. [23].
There are previous works that use special purpose hardware to solve linear algebra problems.O'Malley and Vesselinov [24] discuss how linear least squares can be solved via QUBO formulations on D-Wave quantum annealing machines.They consider both the case when the solution vector is restricted to being binary and when it is real valued.The real valued case is handled by representing entries in the solution vector using a fixed number of bits.O'Malley et al. [25] consider a nonnegative/binary factorization of a real valued matrix of the form A ≈ W H, where W has nonnegative entries and H is binary.To compute this factorization, they use an alternating least squares approach by iteratively alternating between solving for W and H.When solving for H, they use the QUBO formulation from [24] for the corresponding binary least squares problem and do the computation on a D-Wave quantum annealer.Drawing inspiration from [24], Ottaviani and Amendola [26] propose a QUBO formulation for low-rank nonnegative matrix factorization and also implement it on a D-Wave machine.They too use an alternating least squares approach combined with real number representations similar to those in [24].Borle et al. [27] show how the Quantum Approximate Optimization Algorithm (QAOA) framework can be used for solving binary linear least squares.Their paper includes experiments run on an IBM Q machine.Unlike our paper, none of the works [24][25][26][27] consider binary matrix factorization.Additionally, an important difference between our work and the decomposition techniques developed in [25,26] is that those papers update the factor matrices in an alternating fashion.Our two QUBO formulations, by contrast, solve for both factor matrices at the same time, which may help avoid the issue of getting stuck in local minima that alternating algorithms are susceptible to.However, we do incorporate alternating optimization as a post-processing step in our experiments since this can sometimes further improve the quality of the solutions that come from solving the QUBO formulations.See the sections Handling large rectangular matrices and Experiments for details.
There has also been a large body of research on utilizing special purpose hardware for data clustering problems, for example [28][29][30][31][32] to name a few.A recent paper by S ¸eker et al. [33] performs a comprehensive computational study comparing the DA to multiple state-of-the-art solvers on multiple different combinatorial optimization problems.They find that the DA performs favorably compared to the other solvers, particularly on large problem instances.

QUBO formulations for BMF
Writing out the objective in (1), we get where the summations are over i ∈ [m], j ∈ [n], and k, k ∈ [r].Our goal is to reformulate this into the QUBO form in (2).The fourth order term in (3) stops us from directly writing (3) on the quadratic form in (2).We can get around this by introducing appropriate auxiliary variables and penalties.

Formulation 1
For our first formulation, we introduce auxiliary variables We arrange these variables into m × n matrices ( To incorporate the constraints ( 4) in the QUBO model, we express them as a penalty instead.A standard technique for this [15] is to use a penalty function f : Notice that f (a, b, c) = 0 if a = bc and f (a, b, c) ≥ 1 otherwise.Letting λ be a positive constant, a penalty variant of ( 5) is min if and only if it minimizes (7).
Proof.For brevity, we denote the objectives in ( 5) and ( 7) by OBJ 1 and OBJ 2 , respectively.Setting all entries in the matrices U , V , W (1) , . . ., W (r) to zero would yield an objective value of OBJ F and therefore could not be a minimizer of (7).Any minimizer of (7) must therefore satisfy (4).
We now state the QUBO formulation of (7).Define , and let where x is a column vector of length (m + n + mn)r.Furthermore, define the QUBO matrix as where Proposition 3.With x and Q defined as in (8) and (9), respectively, the problem (7) can be written as The proof is a straightforward but somewhat tedious calculation and is omitted.Although removing the constant A 2 F does not affect the minimizer(s) of (10), it can serve as a useful target: If x Qx = − A 2 F , then we know that we have found a global minimum, provided that the condition on λ in Proposition 2 is satisfied.Such a target value can be supplied to QUBO solvers like D-Wave's QBSolv to allow for early termination when the target is reached.

Formulation 2
For our second formulation, we again consider (3) and introduce auxiliary variables We treat U = ( u i(kk ) ) and V = ( v j(kk ) ) as matrices of size m × r 2 and n × r 2 , respectively, with u * (kk ) being the (k + k r)th column of U , with similar column ordering for V .An equivalent formulation to (1) is then min December 16, 2021 6/22 We use the function f defined in (6) to incorporate the constraints (11) in the objective.A penalty variant of ( 12) is min (12) if and only if it minimizes (13).
The proof is similar to that for Proposition 2 and is omitted.Since (1) and ( 12) are equivalent, Proposition 4 implies that the matrices U and V we get from minimizing (13) are minimizers of (1) when λ is sufficiently large.
We now state the QUBO formulation of (13).Define u and v as before.Furthermore, define where y is a column vector of length (m + n)(r + r 2 ).Furthermore, define the QUBO matrix as where Proposition 5.With y and P defined as in (14) and (15), respectively, the problem (13) can be written as The proof is a straightforward and is omitted.

Useful constraints for data analysis
In this section we show how certain constraints that are helpful for data mining tasks easily can be incorporated into the QUBO formulations.One approach to clustering of the rows and/or columns of a binary matrix A is to compute a BMF A ≈ U V and then use the information in U and V to build the clusters.This idea is used e.g. by [2,3] for gene expression sample clustering and document clustering.For gene expression data, the rows of A represent genes and the columns represent samples, e.g. from different people.An unsupervised data mining task on such a dataset could be to identify and cluster people based on if they have cancer or not.One way to do this is to compute a rank-2 BMF of A and assign sample j to cluster k ∈ {1, 2} if v jk = 1.In December 16, 2021 7/22 many cases, it is reasonable to require that each column belongs to precisely one cluster.For example, when clustering people based on if they have cancer or not, we want to assign every person to precisely one of two clusters.Such a requirement can be incorporated by enforcing that A penalty variant of this constraint is where λ > 0. Since V is binary, the penalty is zero when ( 17) is satisfied, and at least λ otherwise.This penalty can simply be added to the objectives in ( 7) and (13).As before, we can ensure that the penalized and constrained formulations have the same minimizers by choosing λ > 2r A 2 F .The penalty ( 18) is straightforward to incorporate into either QUBO formulation.Define an (m + n)r × (m + n)r matrix .
The QUBO formulations in (10) and ( 16) are easily modified to incorporate (18) by defining modified QUBO matrices 3) , 3) , where the submatrices Q (i) and P (i) are defined as before.

Handling large rectangular matrices
In this section, we present a strategy for handling large rectangular matrices.We consider the case when A is m × n with m n and n is of moderate size.These ideas also apply when n m and m is of moderate size.Random sampling of rows and columns is a popular technique in numerical linear algebra for compressing large matrices.These compressed matrices are then used instead of the full matrices in computations.For an introduction to this topic we recommend the survey by Mahoney [34].
A popular sampling approach is to sample according to the leverage scores of the matrix.Suppose A ∈ R m×n is a nonzero matrix, and let B ∈ R m×rank(A) be an orthonormal matrix whose columns form a basis for range(A).The leverage scores of A are defined as i (A) B can be computed via, e.g., the singular value decomposition (SVD).The cost O(mn 2 ) of computing the SVD of A is small compared to the cost of solving the BMF.If this cost proves to be too expensive, then there are techniques for estimating the leverage scores that only cost O(mn log m) [35].When sampling rows of A according to the leverage scores, we sample the ith row of A with probability p i these in a new matrix A (s) ∈ R s×n .We then compute a rank-r BMF A (s) ≈ U (s) V .
To get a BMF for the original matrix A, we then solve the binary least squares (BLS) problem where V comes from the factorization of A (s) .By expanding the objective, the problem in ( 19) can be written as m independent BLS problems involving r binary variables.These BLS problems can be solved via a QUBO formulation.As discussed in [24,25], such a formulation is easy to derive by noting that the BLS objective can be written as Setting gives us a QUBO objective as in (2).Alternatively, when r is small, the optimal solution to each BLS problem can be found by simply testing all 2 r possible solutions.As an optional step after computing U , a few additional alternating BLS steps can be added.This is done by minimizing the objective in (19) in an alternating fashion, first solving for V and treating U as fixed, and then solving for U and treating V as fixed.We found that Formulation 1 yields a lower decomposition error for a given number of iterations than Formulation 2 on the Fujitsu DA.We therefore use the former in our December 16, 2021 9/22 experiments and refer to it as "DA BMF" or just "DA" in the tables.Additionally, we try adding a few extra alternating BLS steps (as discussed in the section Handling large rectangular matrices) to the solutions we get from the DA.We do at most 20 alternating BLS solves, and whenever no improvement occurs after two consecutive BLS solves, we terminate.Since r ≤ 5 in our experiments, we solve the BLS problems exactly by checking all possible solutions.We refer to this method as "DA+ALS BMF" or just "DA+ALS" in the tables.For some of the real datasets, we incorporate the constraint in the section Useful constraints for data analysis.For cases when A is large and rectangular, we use the sampling technique in the section Handling large rectangular matrices.We will point out when the sampling and/or additional constraints are used.We run our proposed method on the Fujitsu DA for a fixed number of 1e+9 iterations.Here, an iteration refers to one iteration of the for loop on line 5 in Algorithm 2 of [18].We do not try to find an optimal number of iterations.We take this approach to avoid cherry picking a number of iterations that works great for each individual problem.By choosing a relative large number of iterations, we are also hoping to push the hardware to see how good solutions it can find.

Experiments Table 1. Mean relative error for synthetic
As discussed by Glover et al. [15], although the penalty λ needs to be sufficiently large to ensure that the constrained and penalized versions of our optimization problems have the same minimizers, setting λ to a smaller value may improve the solution produced by a QUBO solver in practice.An intuitive explanation for this phenomenon is that a large λ value gives a steeper optimization landscape which can make it difficult for a solver to escape local minima.We find this to be true when running our methods on the Fujitsu DA as well.We use λ = 1 in all our experiments since we found this to December 16, 2021 improve the solution quality, while at the same time avoiding constraint violations.As mentioned in the section Related work, the two methods by Zhang et al. [2,3] are the most popular for the variant of BMF we consider.We therefore use these methods for comparison in our experiments.We refer to them as "Penalized" and "Thresholded," respectively.For the penalized version, we use the Bmf method in the Nimfa Python library [36] available at http://nimfa.biolab.si.We leave all parameters to their default values, except the maximum number of iterations (max_iter) and the frequency of the convergence test (test_conv) which we both set to 1000 since we find that this improves performance substantially over the defaults in our experiments.We wrote our own implementation of the thresholded method since we could not find an existing implementation; see the section Implementation of thresholding method for BMF in Supplementary material for details.
We also include a simple baseline method.The idea behind it is simple: If we seek a rank-r BMF of A, we can find one by simply choosing the densest r rows/columns in A. Alternatively, when A has high density, we can approximate it by a rank-1 BMF with all entries equal to 1.This is clearly a very crude method, but it serves as a useful baseline and sanity check for the more sophisticated methods.See the section Details on baseline method in Supplementary material for further details.
All experiment results are evaluated in terms of the following relative error measure: The norm is squared since this is more natural for binary data: When U and V are binary, A − U V 2 F is the number of entries that are incorrect in the decomposition and A 2 F is the number of nonzero entries in A.

Real data
In the first experiment on real data we consider the MNIST handwritten digits dataset [37] (available at http://yann.lecun.com/exdb/mnist/).We consider ten instances of each digit 0-9.The digits are 28 × 28 grayscale images with pixel values in the range [0, 255], where 0 represents white and 255 represents black.We make these binary by setting values less than 50 to 0, and values greater than or equal to 50 to 1.We apply BMF to the digits with target ranks r ∈ [5].The QUBO problem in DA BMF and DA+ALS BMF has (28 + 28 + 28 2 )r = 840r binary variables.Table 5 presents the mean relative error for each method across all instances of all digits.Fig 1 shows an example of the binary digit 3 and the low-rank approximations given by our DA+ALS BMF method.In the second experiment on real data, we consider two gene expression datasets for two types of cancer: leukemia and malignant melanoma.The first dataset (available at https://www.pnas.org/content/101/12/4164)contains 38 gene samples for two kinds of leukemia, one of which can be further split into two subtypes [38].The second dataset (available at https://schlieplab.org/Static/Supplements/CompCancer/CDNA/bittner-2000/) contains 38 gene samples, 31 of which are melanomas and 7 of which are controls [39].We make these datasets binary by using the same thresholding approach as [3] which we describe here briefly.Let 0 ≤ c 1 < c 2 be two real numbers.For a matrix A ∈ R m×n with nonnegative entries, let computed by setting its entries Optionally, the columns of A can be normalized so that they have unit Euclidean norm prior to discretization.As an additional step, we remove any rows from A that contain only zeros.Following [2,3], we use c 1 = 1/7 and c 2 = 5 without column normalization on the leukemia dataset.The melanoma dataset contains negative entries.We therefore first add a constant − min ij a ij to all entries in A. Then, we normalize the columns of the resulting matrix and discretize using c 1 = 0.96 and c 2 = 1.04.After thresholding and removing any rows that are all zero, the two datasets are matrices of size 4806 × 38 and 2201 × 38, respectively.We use the sampling technique in the section Handling large rectangular matrices for both datasets with a sample size of s = 30.Furthermore, we include the constraints on the V matrix in the QUBO formulations as discussed in the section Useful constraints for data analysis.Although we expect that this will increase the decomposition error somewhat, including such constraints can be helpful e.g. when clustering the samples.The QUBO problem in our methods has (30 + 38 + 30 • 38)r = 1208r binary variables.Table 6 reports the mean relative error across 10 trials for each dataset.

Discussion
The accuracy improvement of DA+ALS BMF over DA BMF is typically very small, but more substantial in a few cases.Our DA+ALS BMF has the same or better accuracy as either of the methods by [2,3] in 72 of the 75 experiments reported in Tables 1-6 in this paper, and a strictly better accuracy in 57 of the experiments.
For a fixed rank, the number of binary variables for the QUBO problem in DA BMF is similar across all experiments.The anneal time, which is the time spent by the DA looking for a solution, is about 40 seconds for the largest problems.The additional ALS steps used for DA+ALS take on average much less than a second when m = 30, and less than about 3 seconds when m = 2000.For m = 50000, the additional time for ALS can be more substantial, adding on average as much as 66 seconds for rank 5 experiments.Two advantages of the methods by [2,3] are that they typically are quite fast and that they can run on a standard computer.For all experiments except the synthetic ones with m = 50000, their penalized and thresholded algorithms run in less than 2 and 20 seconds on average, respectively.The very large experiments with m = 50000 can take longer, up to 39 and 289 seconds on average for the penalized and thresholded algorithms, respectively, when r = 5.
Based on these observations, DA+ALS BMF seems like the superior method when accuracy is crucial.The methods by [2,3] may be more suitable when speed and accessibility are more important.With that said, we believe that the DA could be run with many fewer iterations with little or no degradation in performance in most of our experiments.The trade-off between accuracy and speed for the DA, and how to choose the number of iterations to strike a good balance, are interesting directions for future research.
Certain matrices, like those of size 2000 × 30 and expected density 0.2 considered in Table 2, seem inherently difficult to handle for any of the sophisticated methods.Indeed, it is surprising that the simple baseline method substantially outperforms all other methods.

Conclusion
BMF has many applications in data mining.We have presented two ways to formulate BMF as QUBO problems.These formulations can be used to do BMF on special purpose hardware, such as the D-Wave quantum annealer and the Fujitsu DA.We also discussed how clustering constraints can easily be incorporated into our QUBO formulations.Moreover, we showed how sampling and alternating binary least squares can be used to handle large rectangular matrices.Our experiments, which we run on the Fujitsu DA, are encouraging and show that our proposed methods typically give December 16, 2021 15/22 more accurate solutions than competing methods.The special purpose hardware technologies discussed in this paper are still in an early phase of development.As these technologies mature, we believe that they will emerge as powerful tools for solving problems in data mining and other areas.

Supplementary material
Implementation of thresholding method for BMF Zhang et al. [2,3] present two algorithms for BMF.The penalty based version (Algorithm 1 in [3]) is available in existing Python implementations; see e.g.[40] and [41].We have not been able to find an implementation of their thresholding algorithm (Algorithm 2 in [3]).For their thresholding algorithm, they give two alternative solution approaches: Discretization and gradient descent.They only use the gradient descent based variant in their experiments, since they say discretization is too computationally expensive.However, as we show below, if the discretized variant is implemented carefully, it can find the optimal solution to the thresholding approach in O(mnr 2 min(m, n)) time, which is fast enough for our experiments.
For a matrix W ∈ R m×r , we define g to be the function that outputs a vector v = g(W ) ∈ R mr such that v contains all the elements in W , ordered in descending order, i.e., such that v i ≥ v j whenever i < j.When two entries in W have the same value, which of them comes first in v does not matter, as long as this is decided in some consistent fashion.We also assume that the functions f r and f c take an entry v k as input and return the row and column position, respectively, of this entry in W .For example, if the number v k corresponds to an entry in position (i, j) in W , then f r (v k ) = i and f c (v k ) = j.The thresholding function θ : R → {0, 1} is defined as The number η is any positive constant.Algorithm 1 provides our implementation of the thresholding method by [3].We use a variant of it coded in Python in our experiments.The reason this algorithm works is that there are only mr + 1 ways to threshold W and only nr + 1 ways to threshold H.The nested for loops search through all the thresholdings that result in nonzero values of X.Moreover, since each entry of X is nondecreasing during the iterations of the inner for loop, if X is nonbinary for some p and some q = k i , it will be nonbinary for that same p and all subsequent q = k j where j > i.And this is why it is fine to break out of the inner for loop, since none of the following iterates are going to lead to valid factorizations.Algorithm 1 is presented at a higher level to make it easier to follow.We now discuss some implementation details that make the algorithm faster and allow us to achieve O(mnr 2 min(m, n)) run time.
• Line 7: Setting W = 0 m×r before the outer for loop, it is sufficient to make the update w fr(p)fc(p) = 1 on this line.
• Line 9: Setting H = 0 r×n before the inner for loop, it is sufficient to make the update h fr(q)fc(q) = 1 on this line.
• Line 10: Setting X = 0 m×n before the inner for loop, it is sufficient to make the update x * fc(q) = x * fc(q) + w * fr(q) on this line.
• Line 11: Assuming that the column x * fc(q) was saved prior to making the update on line 10 in a vector t, we can now compute December  • Lines 16 and 17: Instead of updating all entries in the two matrices, it is sufficient to just keep track of the pair (i, j) corresponding to the p = v i and q = k j that resulted in the new best decomposition.The matrices W and H to be returned can then be computed right before the return statement.
Assume m ≤ n.The complexity of each inner loop iteration in the algorithm is O(m).This is then repeated O(mnr 2 ) times over all inner and outer loop iterations.The total cost is therefore O(mnr 2 • m).If n ≤ m, we can transpose the input matrix before applying the algorithm, resulting in a complexity of O(mnr 2 • n).The complexity is therefore O(mnr 2 min(m, n)), as claimed.

Details on baseline method
Algorithm 2 describes our baseline method in detail.On lines 1 and 2, U and V are initialized to zero matrices.On line 4, the current residual E is computed.On line 5, the least index i is computed such that the sum of the i th row of E is equal to the max row sum of E. A similar index is computed for the columns of the residual on line 6.The update on lines 8 and 9 eliminate the densest column in the residual.Similarly, the update on lines 11 and 12 eliminate the densest row in the residual.The first if statement determines if a row or column should be eliminated in order to maximizing the reduction in the residual.Finally, the second if statement checks if a simple rank-1 approximation consisting of only ones yields a better approximation than the row/column elimination procedure.

Algorithm for generating binary matrices
In this section, we present the algorithm we use for generating A with an exact rank-r decomposition in the synthetic experiments.It is given in Algorithm 3. The factor matrices U and V are randomly initialized with entries drawn independently from Bernoulli distributions with success probabilities p U and p V , respectively.We use p U = p V = 0.7 in our experiments.At this point, if r > 1, it may be the case that the product A = U V is not binary.The purpose of the while loop is to eliminate nonzero entries in U and V until U V is binary.ρ is a function which returns the average of the entries in a matrix, e.g., ρ(U ) = ik u ik /(nr).For each iteration of the while loop, a nonzero entry is eliminated in the factor matrix with the highest entry average.Eventually, A = U V will be binary, at which point the while loop terminates and A is returned.
def = i (A)/ rank(A) for i ∈ [m].This definition guarantees that i p i = 1.We use leverage score sampling as a heuristic for compressing A by sampling s m of the rows of A with replacement according to the distribution (p i ) and putting December 16, 2021 8/22 A with an exact decomposition.The * symbol indicates methods we propose.Best results are underlined.Target ranks r (m = 30)

Fig 2 .
Fig 2. The thresholded leukemia data.Black entries are 1 and white entries are 0.

F 12 Set 15 else
flag to true if X ∈ {0, 1} m×n , false otherwise 13 if flag is false then 14 break // Since X won't be binary for subsequent qs and H • Line 12: The value of the flag can be computed by evaluating the truth of [x * fc(q) ∈ {0, 1} m ].

Algorithm 2 : 6 7 if j e i j < i e ij then 8 9 Set v j k = 1 10 else 11 Set u i k = 1 12 2 Fthen 16 Set
Baseline method Data: Matrix A, target rank r Result:Binary matrices U , V such that A ≈ U V is a binary matrix factorization 1 Set U = 0 m×r 2 Set V = 0 n×r 3 for k ∈ [r] do 4 Set E = A − U V // Compute residual 5 Let i = min{i ∈ [m] : j e ij = max j e j } Let j = min{j ∈ [n] : i e ij = max i e i } /* Eliminate row/column in residual with most nonzeros */ Set u * k = e * j Set v * k = e i * if trivial rank-1 approximation of all 1's is better */ 15 if A − 1 m×n < A − U V 2 F U = [1 m×1 , 0 m×(r−1) ]17Set V = [1 n×1 , 0 n×(r−1) ]18 end19 return U , V

Table 2 .
Mean relative error for synthetic A for which a ij ∼ Bernoulli(0.2).The * symbol indicates methods we propose.Best results are underlined.

Table 3 .
Mean relative error for synthetic A for which a ij ∼ Bernoulli(0.5).The * symbol indicates methods we propose.Best results are underlined.

Table 4 .
Mean relative error for synthetic A for which a ij ∼ Bernoulli(0.8).The * symbol indicates methods we propose.Best results are underlined.

Table 5 .
Mean relative error for MNIST experiments.The * symbol indicates methods we propose.Best results are underlined.

Table 6 .
Mean relative error for gene expression data.The * symbol indicates methods we propose.Best results are underlined.
Efficient implementation of Algorithm 2 in [2] Data: Matrices A, W , H such that A ≈ W H is a nonnegative matrix factorization Result: Binary matrices W , H such that A ≈ W H is a binary matrix factorization /* Initialization */ 1