A loop-counting method for covariate-corrected low-rank biclustering of gene-expression and genome-wide association study data

A common goal in data-analysis is to sift through a large data-matrix and detect any significant submatrices (i.e., biclusters) that have a low numerical rank. We present a simple algorithm for tackling this biclustering problem. Our algorithm accumulates information about 2-by-2 submatrices (i.e., ‘loops’) within the data-matrix, and focuses on rows and columns of the data-matrix that participate in an abundance of low-rank loops. We demonstrate, through analysis and numerical-experiments, that this loop-counting method performs well in a variety of scenarios, outperforming simple spectral methods in many situations of interest. Another important feature of our method is that it can easily be modified to account for aspects of experimental design which commonly arise in practice. For example, our algorithm can be modified to correct for controls, categorical- and continuous-covariates, as well as sparsity within the data. We demonstrate these practical features with two examples; the first drawn from gene-expression analysis and the second drawn from a much larger genome-wide-association-study (GWAS).

Given N T we choose κ 2 = κ 2 (N T ) (e.g., if there were 2 continuous-covariate-components per patient we would choose κ 2 ≈ 0.34). This then implies that the sub-scores for j ∈ J B or k ∈ K B will be: Consequently, when the covariates of B are drawn from µ (T |Q), we expect that on average with respect to µ (Q) the signal produced by B will be.
[Z ROW ] 2 j ∼ 0 and [Z COL ] k ∼ 0. To summarize so far: we've introduced a method for calculating row-and column-scores which corrects for multidimensional continuous-covariates in the following sense. A bicluster B with its covariates drawn isotropically from the covariate-space R NT will produce a signal comparable to what that bicluster would have produced in our original algorithm (i.e., a row-signal of (m/M ) (n/N ) 2 (2g − 1) and a column-signal of (m/M ) 2 (n/N ) (2g − 1)). If the covariates associated with B are not isotropically distributed in R NT , but are instead concentrated with respect to angle in R NT , then the signal produced by B will be demoted. A bicluster B with its covariates drawn from only a half-space of R NT will, on average, produce a signal of 0. A bicluster B with its covariates drawn from an even more narrow cone of R NT will produce a negative signal. And finally, in the extreme scenario where each covariate of B is identical, the row-signal of B will be − 1/κ 2 − 1 (m/M ) (n/N ) 2 (2g − 1) and the column-signal − 1/κ 2 − 1 (m/M ) 2 (n/N ) (2g − 1). In order for this kind of covariate-correction to be useful, the number of patients m B in any planted bicluster B must be sufficiently large. Specifically, m B must be large enough that the averages Z Thus, if m B is insufficiently large, even the signals produced by covariate-balanced biclusters will be demoted, and our algorithm will not perform well. Based on this observation, we see that the m B required for covariate-estimation grows exponentially with N T , but is not too large for small N T ; for N T = 2 the required m B is roughly a few dozen or so (see Figs 48, 49 and 50).
Before we move on, we mention another kind of continuous-covariate correction; one that we'll refer to later as a '1-sided' correction. These numerical experiments illustrate the performance of our algorithm on a simple planted-bicluster problem involving continuous covariates with dimension N T = 2. For each numerical experiment we let D be a large M × M random matrix with entries chosen independently from a distribution with median 0. We let T be an M × 2 matrix of covariates with each row chosen randomly from an isotropic distribution on R 2 . For illustration, the rows of T are represented as unit-vectors in R 2 . Once we have chosen T , we select a randomly-oriented half-space H of R 2 and then determine the subset J H of rows of D which have covariates-vectors lying in H. In our illustration we depict one such randomly-oriented H with a white semicircle, and the complement H c with a grey semicircle. We've organized the rows of D so that the rows in J H are on top and the rows of J H c are on the bottom. In Panel-A we implant within D a small m × m bicluster B1 that is rank-1 with error-ε. To implant B1 we choose the m rows and columns at random. Note that B1 will typically be balanced with respect to the covariates; roughly half the rows of B1 will come from J H , and half will come from J H c . We also implant within D a second rank-1 error-ε 'red-herring' bicluster B2 of size m × m. We choose m columns at random from D, and m rows from J H to implant B2. Note that bicluster B2 is not balanced with respect to the covariates. Note that our illustration is slightly misleading; B1 and B2 will not necessarily be disjoint. Rather, because they are randomly chosen, the row-and column-subsets of B1 and B2 will typically have a small overlap. For the purposes of this simple numerical experiment the entries in this overlap are randomly assigned to either B1 or B2. In Panel-B we implant B1 and B2 as before, except that B2 is of size 2m × 2m. We also implant a rank-1 error-ε B3 of size 4m × 4m with rows restricted to J H c . For each numerical experiment we calculate the average auc A 1 for the bicluster B1, as well as the average auc A 2 for bicluster B2 (as well as A 3 for B3 if it exists). Our goal is to find the first, despite the masking effect of the latter. As shown in Fig 49, an m × m B2 does not compromise the effectiveness of our algorithm all that much. However, as shown in Fig 50, the presence of a larger B2 and B3 interfere with our algorithm's ability to successfully detect B1 when M is insufficiently large.  Fig 48A). In panel-A we show the trial-averaged value of A 1 as a function of ε √ m and log M (m). In panel-B we show the trial-averaged value of A 2 . The presence of B2 slightly impacts our algorithm's performance (compare with Fig 26). Note that, when M is insufficiently large (in this case 64), our algorithm focuses on B2 rather than B1.  Fig 48B). In panel-A we show the trial-averaged value of A 1 as a function of ε √ m and log M (m). In panel-B we show the trial-averaged value of max (A 2 , A 3 ) which is usually A 3 . The presence of B2 and B3 make it difficult for our bicluster to detect the smaller B1 when the M is low. Note that, when M 1024, our algorithm focuses on B2 or B3 rather than B1. Z [t] 1s COL k = j ′ ,j∈D, j ′ =j; k fixed, k ′ =k D jk D jk ′ D j ′ k ′ T j ′ t D j ′ k , for each t ∈ {1, . . . .N T } . This 1-sided correction only counts the continuous-covariates associated with the j ′ row-index, ignoring the contribution from T j,t . The associated row-and column-scores are: The 1-sided correction is essentially the same as the 2-sided correction with regards to the row-subscores. However, the 1-sided correction is rather different for the column-subscores. Under the 1-sided correction the column-subscores only capture the norm (rather than the norm-squared) of the averaged continuous-covariate vector accumulated across biclusters that include column-k. This difference won't affect the D-only situation very much, but will have ramifications when we include the X-matrix of controls in section 12.

Correcting for Sparsity
Up to this point we've been assuming that the entries of the original data-matrix take on a range of values, both positive and negative. Our notion of low-rank was predicated on this assumption: Indeed, we've spent our time analyzing how our loop-scores function when the underlying data-matrix was drawn from an isotropic gaussian distribution, assuming that any planted-biclusters were drawn from an eccentric (nonisotropic) gaussian distribution. Because binarization is built into our algorithm, our analysis immediately carries forward to the situation where the original data is binary rather than continuous, provided that we assume the underlying data-matrix was drawn from a binarized version of an isotropic gaussian, and that any planted-biclusters were drawn from binarized versions of eccentric gaussians.
All of this analysis depends on one thing: our algorithm assumes that the fraction of negative entries within the datamatrix is comparable to the fraction of positive entries. If the original data-matrix has a surplus of, say, negative entries, then the binarized version of the data-matrix will be 'sparse' (i.e., the fraction of positive entries will be much less than 0.5, and the fraction of negative entries will be much greater than 0.5). Such a sparse matrix is likely to contain large submatrices consisting of mostly negative entries, simply due to their abundance. As we originally described above, our algorithm will typically focus on these large submatrices, even though they are not statistically significant. The reason for this behavior is that the abundance of negative entries gives rise to an abundance of 'all-negative loops' (i.e., rank-1 loops involving only negative entries). The loop-score we describe in section 2 treats all rank-1 loops the same way -each adds one to the total score -regardless of whether or not those loops consist of positive or negative entries (or a mixture of the two). Therefore, the large number of all-negative loops drowns out the other (more significant) rank-1 loops, obscuring the presence of any significant biclusters.
To rectify this behavior we can correct for sparsity. The main idea is to rescore the loops so that rank-1 loops consisting of entries that are otherwise abundant do not add much to the score, whereas rank-1 loops containing entries that are otherwise rare add more to the score. To describe this correction let's assume that our M × N data-matrix D is binary, with sparsity p (i.e., a fraction p of entries equal to +1, and a fraction q = 1 − p of the entries equal to −1). We set α = p − q to be the column mean of each column of D, and 1/δ = 4pq to be the inverse of the variance of each column. Using this notation we define: These scores 'normalize' D by subtracting the mean from each entry and then dividing by the standard-deviation. This normalization ensures that, when considering a large random matrix in the large M, N limit, the mean row-and column-score will remain close to 0 and the variance of the row-and column-scores will remain close to 2/ M N 2 and 2/ M 2 N , respectively (regardless of the fixed sparsity-coefficient p). Any rank-1 loop composed of common entries (e.g., the abundant negative entries in our earlier example) will only add a small amount to the total score, whereas a rank-1 loop composed of rare entries (e.g., positive entries) will add a larger amount to the total score. These numerical experiments illustrate the performance of our algorithm on a simple planted-bicluster problem involving a sparse matrix with sparsitycoefficient p = 0.0625. For each numerical experiment we let D be a large M × M random matrix with entries chosen independently and randomly to be +1 with probability p = 0.0625, and −1 with probability q = 1 − p. Within D we implant an ε-error m × m bicluster B1 with sparsity-coefficient 2p, and another ε-error 2m × 2m bicluster B2 with sparsity-coefficient p. These biclusters are both approximately rank-1, and are generated using the method described in section 5.2, with m, ε and the sparsity-coefficient given as parameters. Both biclusters are implanted randomly. Note that our illustration is slightly misleading; B1 and B2 will not necessarily be disjoint. Rather, because they are randomly chosen, the row-and column-subsets of B1 and B2 will typically have a small overlap. For the purposes of this simple numerical experiment the entries in this overlap are randomly assigned to either B1 or B2. For each numerical experiment we calculate the average auc A 1 for the bicluster B1, as well as the average auc A 2 for bicluster B2. Our goal is to find both B1 and B2, even in the presence of the abundant negative entries within D. Because of the tradeoff between size, noise and sparsity, there will be combinations of m, ε and p for which B2 is prioritized over B1, and vice-versa.
To be more specific, of the 16 different possible loops within a binary-matrix, 8 will be rank-1 and 8 will be rank-2 (recall Fig 20). The 8 rank-1 loops fall into three categories: (a) four negative entries, (b) two negative and two positive entries, and (c) four positive entries. These categories contribute, respectively, p 2 /q 2 , 1, and q 2 /p 2 to the summands above. The 8 rank-2 loops fall into two categories: (d) three negative entries and one positive entry, and (e) three positive entries and one negative entry. These categories contribute, respectively, −p/q and −q/p to the summands above.
To relate this to the planted-bicluster problem, let's consider an M ×N data-matrix D with sparsity-coefficient p < 0.5. Let's implant within D a rank-1 error-ε m×n bicluster B with a potentially different sparsity-coefficientp with p <p < 0.5. We'll generate B using the algorithm described in section 5.2. If ε √ m is not too large, the bicluster B will produce a row-sigal of approximately (m/M ) (n/N ) 2 η and a column-signal of approximately (m/M ) 2 (n/N ) η, where η is given by: ξ 4,0 = p 4 +q 4 , ξ 2,2 = 4 p 3q +p 2q2 +pq 3 , and ξ 0,4 = 2p 2q2 , andp andq = 1 −p given by the construction in section 5.2 (i.e., using sparsity-coefficientp). The terms ξ 4,0 , ξ 2,2 and ξ 0,4 represent the distribution of loops from categories (a), (b) and (c), respectively, from within the outer product uv ⊺ used in the construction of B. In the limit ε √ m → 0 and p ≪ 0.5, the signal-factor η can be approximated by: The overall signal produced by B depends on the size of B and the noise (as before), but also on the sparsity coefficientp of B, which influencesp and η. The signal-factor η typically grows withp and, ultimately, there is a tradeoff: a bicluster B1 which is small but has a higherp may produce a signal that is just as great as another bicluster B2 that is large but has a lowerp. For example, the bicluster B1 may have more rank-1 loops containing rare positive-entries; their presence causes the loops of B1 to count for more than the typical loops of B2. An example along these lines is provided in Figs 51 and 52. Sparsity arises in practice when attempting to analyze genotyped data; single-nucleotide-polymorphisms (SNPs) can involve minor-allele-frequencies (MAFs) that are quite small (e.g., 0. A B Figure 52: These numerical experiments illustrate the performance of our algorithm on a simple planted-bicluster problem involving a sparse matrix (see Fig 51). In panel-A we show the trial-averaged value of A 1 as a function of ε √ m and log M (m). In panel-B we show the trial-averaged value of A 2 . Note that our algorithm typically finds both B1 and B2, despite the abundance of negative entries in D. Depending on the parameters, either B1 or B2 can be prioritized. are very far from 0.5 (e.g., ≤ 0.01 or ≥ 0.99). Moreover, the different columns k of the data-matrix often correspond to different SNPs with different MAFs and hence different sparsity-coefficients p k . We can generalize our sparsity-correction to account for this scenario by defining: where α ∈ R N is the N × 1 vector of means α k = p k − q k , ∆ ∈ R N ×N is the diagonal matrix with entries ∆ −1 kk = 4p k q k , and 1 ∈ R M is the M × 1 vector of all ones. Both these row-and column-scores can be computed using two binary matrixmatrix multiplications as well as O (1) matrix-vector multiplications and O (max(M, N )) vector-vector dot-products (see section 12 for details).

Putting it all together
So far we've discussed modifications to our algorithm that account for cases-vs-controls, categorical-and continuouscovariates, as well as sparsity. One of our goals was to ensure that these modifications were not mutually exclusive; they can all be combined into one algorithm which accounts for all of the above. When combining these modifications we end up calculating a variety of sub-scores which contribute in various ways to our final row-and column-scores. Within the context of our methodology, we have chosen to perform the following steps in order: 1. Correct for sparsity.

Correct for continuous covariates.
3. Given I req , correct for categorical covariates.

Correct for cases versus controls.
Based on our experience, we believe that the sparsity correction should be performed first, because the sparsity correction alters the weight of each loop. After that, the correction for continuous-covariates should come before the correction for categorical-covariates; this ensures that each category refers to its own distribution of continuous-covariates. Finally, the correction for categorical-covariates should come before the correction for cases-vs-controls; this ensures that biclusters which are balanced across case-categories are not unfairly penalized by an imbalance across control-categories. Two examples which have motivated our choices are given in Fig 53. The main user-specified input-parameter is I req , which is used in Step-3 to dictate the number of covariate-categories a bicluster must extend across in order to be considered by the algorithm. Technically speaking, Step-2 also involves a parameter κ 2 , which dictates how broadly a bicluster must be distributed in continuous-covariate-space in order to be considered. While κ 2 could in principle be modified by the user, we typically set κ 2 so that, on average, biclusters produce no signal when they are drawn from only half of covariate-space (see Fig 47).
Before we describe our general procedure, we briefly review our notation. We'll consider a situation where the patients are segregated into I cat different covariate-categories (e.g., study, batch-number, recording location, etc). We'll use the post-script i to track the category-number. We'll denote by M Di and M Xi the number of case-and controlpatients in category-i. Each patient will also be associated with N different genetic-measurements. Each of these geneticmeasurements will, in general, give rise to its own sparsity-coefficient p k (for k = 1, . . . , N ) calculated across the entire patient population. The genetic-data for the case-and control-patients in category-i is contained in the data-matrices Di and Xi, of size M Di × N and M Xi × N , respectively. The matrix-entry Di (j, k) contains the k th -genetic-measurement for the j th -case-patient from category-i. The matrix-entry Xi (j, k) contains the same for the j th -control-patient. In addition to the genetic-data, each patient will also be associated with an N T -dimensional vector of continuous-covariates. These continuous-covariates are contained in covariate-matrices T Di and T Xi , of size M Di × N T and M Xi × N T , respectively. The row-vector T Di (j, t) for t = 1, . . . , N T contains the continuous-covariates for the j th case-patient in category-i. The row-vector T Xi (j, t) for t = 1, . . . , N T contains the same for the j th -control-patient. In the expressions below we'll use the dummy-variable 'Q' below to denote either D or X; e.g., M Qi could be either M Di or M Xi . We'll also use the notation 1 Qi to refer to the M Qi × 1 vector of all ones.
After binarizing all the matrices involved we calculate the sparsity coefficients: In Panel-A we show a situation with both continuous-and categorical-covariates. The continuous-covariates are drawn from an N T = 2 dimensional space (right side). There are two covariate-categories, male and female. The bicluster B shown extends across many males as well as many females. While the females in B have continuous-covariates drawn isotropically from R NT , the males in B all have continuous-covariates drawn from the same half-space H ∈ R NT (indicated by the white semicircle). If we were to correct for categorical-covariates before correcting for continous-covariates, the signal produced by B would be high, which is incorrect in this case. If, instead, we correct for continuous-covariates before correcting for categorical-covariates, the signal produced by B will be correctly reduced to zero. In Panel-B we show a situation with both categorical-covariates as well as cases-vs-controls. The bicluster B extends across both male-cases and female-cases, but only across male-controls. If we were to correct for cases-vs-controls before correcting for categorical-covariates, the score for B would be reduced to zero, even though B is balanced across the cases in a manner not exhibited across the controls. If, instead, we correct for categorical-covariates before correcting for cases-vs-controls, the signal produced by B will be high, as it should be.
Using these sparsity coefficients we define the probabilities q k = 1 − p k , the means α k = p k − q k , and the variances δ −1 k = 4p k q k . These means and variances allow us to define the N × 1 vector α and the N × N diagonal matrix ∆ with diagonal-entries δ k . We then compute the base sub-scores for each combination of Q = D and Q ′ ∈ {D, X}, for each pair of categories i, i ′ : as well as the covariate-dependent sub-scores for each t = 1, . . . , N T : Combining these sub-scores allows us to correct for continuous-covariates within each combination of Q, Q ′ , i, i ′ : where κ 2 = κ 2 (N T ) as defined in section 10. We then define sub-scores for each Q, Q ′ , i, i ′ .
We correct for categorical covariates by choosing the I req -largest element across i ′ for each row-sub-score, and the I 2 reqlargest element across i, i ′ for each column-sub-score (see section 9): Finally, we correct for cases-vs-controls by subtracting the control-sub-scores from the case-sub-scores (see section 8): An example illustrating this combination of corrections is shown in Figs 54 and 55. This example illustrates the performance of our algorithm on a planted-bicluster problem involving both an M × N case-matrix D as well as a 2M × N control-matrix X, where N = 3M . Within D and X, the case-and control-patients (i.e., rows) are randomly assigned to one of I cat = 3 categories, with M Di = M/3 and M Xi = 2M/3 respectively. In addition, we endow each patient with a randomly generated continuous-covariate vector of dimension N T = 2. The genes (i.e., columns) have varying degrees of sparsity; one-third of the columns have a sparsity-coefficient p = 0.125, one-third have p = 0.25, and one-third have p = 0.5.
Within these case-and control-matrices we implant several rank-1 ε-error biclusters, each with sparsity p = 0.25. The size of these implanted biclusters will depend on m and n, where n is related to m by ensuring that log Case-specific and balanced: The first bicluster we implant is an m × n bicluster B0 restricted to the cases. The bicluster B0 is implanted within D by choosing m rows and columns at random; consequently, B0 is a case-specific bicluster that is typically balanced across all 3 covariate-categories and continuous-covariates.
Case-specific and somewhat balanced: The next three biclusters we implant, B1 B2 and B3, are each of size 3m/2 × 3n/2 and are restricted to the cases. Each of these bicluster has its rows and columns chosen at random from D, with the restriction that B1 excludes category-1, B2 excludes category-2, and B3 excludes category-3; each of these three case-specific biclusters is balanced with respect to the continuous-covariates, but only extends across 2 of the 3 covariate-categories.
Case-nonspecific: The next bicluster we implant, B4, is of size 9m/2 × 3n/2, and extends across both the cases and controls. This bicluster has 1/3 of its rows drawn randomly from D and 2/3 drawn from X; B4 will be balanced with respect to the continuous-and categorical-covariates, but will not be case-specific (extending into the controls).
Case-specific but imbalanced: The final two biclusters we implant, B5 and B6, are each of size 3m/2 × 3n/2, and are restricted to the cases. Each of these biclusters has its rows and columns chosen at random from D, with the restriction that B5 be restricted to rows that have continuous-covariates drawn from a randomly oriented halfspace H ∈ R NT , and B6 be restricted to rows that have continuous-covariates drawn from the complement H c . Consequently, each of these two case-specific biclusters will be balanced with respect to the categorical-covariates, but not the continuous-covariates.
All of our biclusters have columns drawn randomly; each will typically overlap all of the different sparsity-domains in D and/or X. Note that the illustration in Fig 54 is misleading; the biclusters will not necessarily have disjoint columns (nor strongly overlapping rows). Rather, because they are randomly chosen, the row-and column-subsets of each bicluster will typically have a small overlap. For the purposes of this numerical experiment we implant each bicluster in the order described above: any overlapping entries will be overwritten by the next bicluster in the list (with the imbalanced biclusters taking precedence).
For each numerical experiment we calculate the average auc A 0 for the bicluster B0, as well as the average auc A 1 for bicluster B1 and so forth. In our first numerical experiment shown on the left of Fig 55 we set I req = 3; We are looking for biclusters that span all of the covariate-categories, and we are not interested in finding B1 B2 or B3. With this choice of I req = 3, our goal is to find B0 despite the masking effects of the other biclusters. In our second numerical experiment shown on the right of Fig 55 we set I req = 2; We are looking for biclusters that span at least 2 of the 3 covariate-categories. With I req = 2 we are satisfied with either B0 B1 B2 or B3, but we still don't want to find B4 B5 or B6, since these are either case-nonspecific or imbalanced with respect to the continuous-covariates.
2-sided versus 1-Sided continuous-covariate correction: As we alluded to earlier, it is possible to correct for continuous-covariates in step-2 by using a '1-sided'-correction, rather than the 2-sided correction described above. These 1-sided corrections involve altering the covariate-dependent sub-scores: Note that these covariant-dependent subscores depend on [T Q ′ i ′ ] j ′ ,t , but not [T Qi ] j,t . These 1-sided-subscores are then combined:  In panel-B we show the trial-averaged value of A 1 , A 2 and A 3 . In panel-C we show the trial-averaged value of A 4 , A 5 and A 6 . The left side of each panel displays results with I req = 3, the right side with I req = 2. Note that, when I req = 3 the goal was to find B0, as this was the only case-specific bicluster that was balanced across all the continuous-covariates and covariate-categories. On the other hand, when I req = 2 the goal was to find the most dominant of B0 B1 B2 and B3, as each of these case-specific biclusters were balanced across all the continuous-covariates while extending across at least 2 covariate-categories. Note that our algorithm typically achieves these goal, despite the presence of the other biclusters B4 B5 and B6. and contribute to the sub-scores for each Q, Q ′ , i, i ′ as follows: As mentioned earlier in section 10, these 1-sided corrections don't change the row-scores much, but do alter the columnscores. Under the 1-sided correction the column-subscores Z 1s COL now capture the norm (rather than the normsquared) of the averaged continuous-covarate-vector accumulated across any biclusters intersecting column-k.
With regards to case-specific biclusters, both the 2-sided and 1-sided continuous-covariate corrections accomplish the same general goal. That is to say, both correction schemes reduce the score of case-specific biclusters that are imbalanced with respect to the space of continuous-covariates, while leaving unaffected the scores of case-specific biclusters that are balanced in covariate-space (at least to to first order). However, these two correction-schemes differ in how they treat certain kinds of biclusters that straddle both cases and controls. The 2-sided correction penalizes biclusters that include a comparable number of cases and controls, regardless of how those cases and controls are distributed in covariate-space. The 1-sided correction also penalizes biclusters that include a comparable number of cases and controls, but behaves differently in one important situation: The 1-sided correction does not penalize biclusters that include both cases and controls as long as (i) the cases are balanced in covariate-space, and (ii) the controls are imbalanced in covariate-space. An illustration of this difference is given in Fig 56. These two different strategies for continuous-covariate correction are useful in different applications. For example, the 1-sided correction is useful if we want to discover a covariate-independent case-specific signature from amongst a heterogenous group of controls. This kind of correction may be appropriate if we expect many different regions in covariate-space to each give rise to their own biclusters (which we should then need to correct for). On the other hand, the 2-sided correction is more conservative, and is more useful for highlighting case-specific biclusters that are robust to changes in the control population. This kind of correction may be appropriate if we want to isolate biclusters that remain significant when compared with arbitrary groups of controls (rather than merely covariate-balanced groups of controls); this kind of goal can be relevant for developing robust risk-prediction models.
Role of binarization: In the prescription above we choose to binarize T . This binarization allows for a more efficient calculation (see next section), but somewhat compromises the accuracy of the covariate-correction. More specifically, after binarization, the covariate-dependent subscores [Z ] k ) only approximate the magnitude (resp. magnitude-squared) of the averaged-covariate-vector associated with any biclusters including row-j (resp. columnk). Once we binarize T , then this averaged-covariate-vector will depend not only on whether or not the relevant biclusters are imbalanced with respect to any half-space H, but also on the orientation of that half-space. For example, assuming the existence of a bicluster B that is fully imbalanced with respect to H, it is possible for H to be oriented in such a way that the covariate-correction does not fully demote the scores associated with B (e.g., the normal vector defining H can lie along a coordinate axis in covariate-space).
To avoid this scenario, and to ensure that all imbalanced biclusters are appropriately penalized, we could have included multiple versions of T , each randomly rotated (in covariate-space) before binarization. With this redundancy, we could calculate the covariate-dependent sub-scores (referred to in the previous paragraph) for each version of T , taking e.g., the highest magnitude across the versions. One issue with this approach is that there is a trade-off between speed and accuracy; the more versions of T we include, the more accurate an estimate we can produce, but the longer it takes to produce this estimate. At some point it becomes more efficient to simply skip the T -binarization step altogether, and just include the full covariate-vector for T .
For our GWAS example at the beginning (see section 1.4), as well as for our numerical-examples throughout, we've seen that these more sophisticated correction techniques are unnecessary; the simplest covariate-correction has sufficed. That is to say, we've binarized T , and only used a single version (with no additional random rotations).

Notes regarding computation
As we touched on earlier, the binarization step in our algorithm sets the stage for the efficient storage and manipulation of matrices and vectors. Many of the laborious computations involved in calculating the row-and column-scores can be accomplished without the use of floating-point operations. As an example, consider the row-sub-score Z Q,Q ′ ; i,i ′ ; [t] ROW j , with i = i ′ , and Q = Q ′ = D. In the following equations we'll refer to the M Di × 1 vector 1 Di as simply 1, and suppress the post-scripts i and i ′ to ease the notation. We'll use matlab conventions for referring to indices: D j,: and D :,k refer to the j th -row and k th -column of D, respectively. We'll also denote the entry-wise product as '. * ', and the entry-wise power as '.ˆ'. I.e., [Q. In each situation we illustrate a data-matrix containing cases D and controls X. We assume that this data-matrix contains a bicluster B which includes both a case-specific and control-specific component of comparable size. We'll focus on the differences that result when these components of B occupy different regions in covariate-space. For ease of illustration, we've divided covariate-space into a half-space H (white), and its complement H c (grey). Each of the four panels illustrates a different arrangement; the case-and control-specific components of B are either balanced or imbalanced with regards to the continuous-covariate. Panel 1 (Case Balanced): When the case-component is balanced but the control-component is imbalanced we see the greatest difference between the 2-sided and 1-sided scheme; the former fully demotes the columnscore of B (reducing it to 0), whereas the latter does not penalize B at all. Panel 2 (Both Balanced): When both components of B are balanced, both the 2-sided and the 1-sided schemes fully demote the column-score (reducing it to 0). Panel 3 (Both Imbalanced): When both components of B are imbalanced, both the 2-sided and the 1-sided schemes fully demote the column-score (reducing it to 0). Panel 4 (Control Imbalanced): When the control-component is balanced but the case-component is imbalanced, both the 2-sided and the 1-sided schemes greatly lower the column-score, reducing it to a negative value.
The sub-score Z can be calculated by first computing the M D × M D correlation-matrix 'C', defined via: and forming the combination: All the terms in these expressions can be rearranged to only involve matrix-vector products, with the exception of D∆D ⊺ in the calculation of C.
The matrix D is binary, and therefore a matrix-matrix product such as DD ⊺ can be calculated using a binary matrixmatrix product. This binary matrix-matrix product does take O M 2 N operations, but can be computed rapidly by appealing to bitwise operations rather than floating-point operations. These bitwise operations typically allow for multiple matrix entries to be processed simultaneously. For example, if u = D j,: and v = D j ′ ,: are two binary row-vectors from D, then the dot-product u ⊺ · v can be calculated by first breaking up u and v into disjoint chunks, each comprising R adjacent bits: and then using bitwise operations on each pair of chunks u r , v r : In the above calculation we assume that R is the number of bits that can be conveniently loaded at once (e.g., R = 128 * (16 − 1) = 1920), and that N is a multiple of R (or that u and v have been extended in length and appropriately masked or padded). The xor operation refers to a bitwise 'exclusive-or', and the popcount operation sums the number of nonzero bits. Our situation is slightly more complicated than the simple example above because the diagonal matrix ∆ is not binary; rather than simply forming u ⊺ v, we have to form u ⊺ ∆ v, which cannot be easily computed using only bitwise operations. Fortunately, in practice the size of ∆ is usually very large (e.g., N 10 5 ), and the variances δ −1 k = 4p k q k are bounded above by 1 and usually bounded below by a constant greater than 0. We take advantage of this observation by sorting the columns of D so that the diagonal entries of ∆ are monotonically ordered. We then approximate each block of R entries of ∆ by a single constant equal to their average: and then calculate: where N r is the number of column-indices in chunk r. This procedure allows us to generate D∆D ⊺ , and ultimately C, by using O M 2 N/R bitwise-operations, rather than O M 2 N floating-point multiplications.
The column-sub-score Z has a similar structure, and can be calculated by first computing and then forming the combination: As in the case with the row-sub-score, all the expressions within this column-sub-score involve matrix-vector products, with the exception of D∆D ⊺ in C, which requires a matrix-matrix product.
Re-Binarization of C: We pause to remark on the calculation of In practice N is typically much larger than any M Qi ; i.e., for genome-wide-association-studies (GWAS) the total number of genetic-measurements (i.e., SNPs) can be on the order of 10 5−6 or more, while the total number of patients is only on the order of 10 3−4 . This implies that, even though calculating D ⊺ k,: diag(T ) Cdiag(T :,t ) D :,k for each k is technically O (N M ) floating-point operations, this calculation can often be the bottleneck for our entire algorithm. In these cases it is advantageous to take advantage of the fact that both D and T are binary-matrices. Although C is not a binarymatrix itself, we can represent C using a collection of b binary-matrices (one such binary-matrix for each significant bit of C we wish to preserve). This kind of representation allows us to calculate Z temp; [t] COL using b binary matrix-matrix calculations, each requiring O N M 2 /R bitwise-operations. We find that this procedure is often more efficient than the direct calculation of Z

temp; [t] COL
, and is sufficiently precise so long as we preserve sufficiently many bits of C (i.e., if b is sufficiently high). When the sparsity-coefficients p k are all close to 0.5, such as in gene-expression-analysis, then 8 bits of C are typically sufficient (i.e., we choose b 8). In cases where the sparsity coefficients p k extend from close to 0 up to 0.5, such as in GWAS, more bits of C are necessary, and we typically choose b 24. This re-binarization can be applied not only to the matrix C that we introduced at the beginning of section 13, but also to the other correlation-matrices involved in the calculation of Z which have a similar structure to C.
More rows than columns: Some applications involve data-matrices D and X with more rows than columns (e.g., M Q > N ). In these situations the calculations above are not efficient; they require O(M 2 Q N ) work, when O(M Q N 2 ) work would be preferable. To arrange the calculation so that only O(M Q N 2 ) work is required, we can first calculate the N × N matrix E: and then compute Z Parallelization and Timing: The full algorithm described in section 12 involves both case-and control-matrices, as well as I cat covariate categories and N T continuous-covariate coefficients. There are many different ways to organize this calculation, with different considerations pertaining to different computing architectures. Our current implementation at the time of this writing uses different nodes to process the different trials (e.g., the original data as well as label-shuffledtrials from H0x). For each trial running on a given node, we use different processors within the node to calculate the scores associated with different combinations of Q ′ , i, i ′ , and t. This implementation is certainly not always optimal (e.g., in the D-only case), but is reasonably efficient for many applications. For example, the GWAS data-set shown in Example-3 was processed over a weekend; the computation used 65 nodes (one for the original-data, and 64 others for the label-shuffled-trials from H0x), each with 16 processors (Intel Xeon Processor E5-2650 v2). The total computation-time per-node was 12 hours, with γ set to 0.05 and b = 32 bits retained for each correlation-matrix. If, instead, we had only retained b = 8 bits (which is sufficient for many problems) and not corrected for the N T = 2 continuous covariates, then the total computation-time per-node would have been reduced by a factor of ∼ 12. If necessary, a further reduction in computation-time could be obtained by increasing γ. row-trace for original-data (red) and H0 (blue) 95th %ile median 5th %ile Figure 57: Row-traces for the bicluster shown in Example-1b from section 1.2. This bicluster was found by running our loop-counting algorithm on the data shown in Fig 5. For this example we corrected for cases-vs-controls; we compare our original-data to the distribution we obtain under the null-hypothesis H0. On the left we show the row-trace as a function of iteration for the original-data (red) as well as each of the 256 random shuffles (blue). On the right we replot this same trace data, showing the 5th, 50th and 95th percentile (across iterations) of the H0 distribution. Because we are not correcting for any covariates, the column-traces are identical to the row-traces.

Notes regarding application
Our biclustering algorithm produces three lists as output. The first lists the rows and columns of the data-matrix in the order in which they were eliminated. The second lists the average row-score taken across the remaining data-matrix at each iteration. The third lists the average column-score taken across the remaining data-matrix at each iteration. We refer to these average row-and column-scores as row-and column-traces; they are both equal to the trace of DD ⊺ DD ⊺ in the D-only situation, but will in general be different when correcting for cases-vs-controls and covariates. In practice it is often critical to determine • How to interpret the algorithm's output, • How significant this output is, • Whether or not a bicluster was found, and what the boundaries of the bicluster are.
• How to find the second, third, and subsequent biclusters.
The answers to these questions are, in general, application dependent. We comment on some of our experiences below.

Interpreting the output:
We have run our algorithm in the context of gene-expression analysis and genome-wide-association-studies (i.e., GWAS). In both cases we were comparing case-patients to control-patients, at times correcting for various covariates. In the context of gene-expression-analysis we had the luxury of normalizing the data so that the sparsity-coefficient for each column was p k = 0.5. This choice of normalization meant that we did not need to correct for sparsity; each α k = 0 and δ k = 1. Because each loop contributed either a +1 or a −1 to our total score, the row-and column-traces produced by loops within D were bounded between −1 and +1. Similarly, the traces produced by loops entering X were also bounded between ±1. The full trace produced by our algorithm was thus bounded by ±2, and was usually in the interval [0, 1]. We could interpret the output in this situation relatively easily: If, after sufficiently many iterations, the row-or column-traces grew large (i.e., came close to +1), then the remaining rows and columns should form a highly correlated low-rank bicluster. An illustration along these lines is shown in Figs 57 and 58, which corresponds to our Example-1b for gene-expression-analysis. A similar illustration is given in Figs 59 and 60, which corresponds to our Example-2 for gene-expression-analysis (which involves categorial covariates). Similarly, Figs 61 and 62 correspond to Example-1a.
For GWAS the same interpretation is not valid; the GWAS data is itself binary because each allele is either dominant or recessive. This means that we do not have the luxury of normalizing the data to alter the sparsity-coefficients; typically the sparsity-coefficients for genotyped-data range from close to 0 all the way up to 0.5. This range of sparsity-coefficients makes interpreting the output of our algorithm more difficult. It is now possible for any loop to contribute a large magnitude to the total scores. The row-and column-traces are no longer bounded by a small number like ±1 or ±2, and often rise to be much higher towards the end of the iterations when the few remaining columns have extreme sparsity-coefficients. In The original-data is indicated with a '⊗', and each of the random shuffles with a colored '•'. The p-value for any point w in this plane is equal to the fraction of label-shuffled-traces that have either an x-position larger than x w or a y-position larger than y w , where x w and y w are the x-and y-percentiles associated with the most extreme coordinate of w (details given in section 14.2). Each random shuffle is colored by its p-value determined by the label-shuffled-distribution. By comparing the original-trace with the shuffled-distribution we can read off a p-value for the original-data of 0.008. For this example we corrected for cases-vs-controls as well as the categoricalcovariates of gender and ethnicity. Because we corrected for categorical-covariates, we compare our original-data to the distribution we obtain under the null-hypothesis H0x. On the left we show the row-trace as a function of iteration for the original-data (red) as well as each of the 2048 random shuffles (blue). Note that the different traces terminate at different iteration-numbers. This is due to the fact that different trials eliminate the various covariate-categories at different iteration-numbers; for each trace we only include iterations for which all covariate-categories remain. On the right we replot this same trace data, showing the 5th, 50th and 95th percentile (across iterations) of the H0x distribution. While not exactly identical to the row-traces, the column-traces are very similar (not shown).  . This bicluster was found by running our algorithm on the GWAS data-set referenced in 1.4. For this example we corrected for cases-vs-controls, as well as continuous-covariates and sparsity. Because we corrected for continuous-covariates, we compare our original-data to the distribution obtained under H0x. Because the data-set had many columns with extreme sparsity-coefficients, the magnitude of our row-and column-traces is not particularly meaningful by itself. A more meaningful measurement is the z-score of the row-and column-traces (calculated with respect to the label-shuffled distribution). On top we plot this z-score as a function of the number of rows remaining. The original-trace is shown in red, and the 64 random shuffles from H0x are shown in blue. On the bottom we show a scatterplot of the maximum z-score (horizontal) and average z-score (vertical) taken across the iterations. The format for this scatterplot is similar to that shown in Fig 58, except that the original-trace lies far to the right of the distribution drawn from H0x (see arrowhead). By comparing the original-trace to the shuffled-distribution, we can read off a p-value of 1/64 (i.e., bounded above by the reciprocal of the number of samples drawn from H0x).
this situation the most natural way to determine whether a low-rank structure has been found is to compare the traces for the original data to traces obtained after label-shuffling the data (see the next subsection).

Determining significance:
Our typical hypothesis (say, H1) is that there exists some disease-related structure in the case-patients that is not exhibited by the control-patients. This is in contrast to the null-hypothesis (H0) in which the case-and control-labels are actually arranged randomly and have no disease-related structure. Under this null-hypothesis the traces we observe after running our algorithm on the original data should be similar to the traces we would find if we were to shuffle the case-control labels randomly (across patients). To draw a sample from this null-hypothesis H0, we shuffle the case-control labels of the patients indiscriminantly while retaining the same number of cases and controls. If we are correcting for covariates, then we restrict our null-hypothesis slightly (H0x) so that the random-shuffles respect the covariates. For example, if we were correcting for gender as a categorical-covariate, we would shuffle the labels of case-males only with control-males, and shuffle the labels of case-females only with control-females. If we were correcting for a continuous-covariate, we would shuffle the case-control labels of patients in each orthant of covariate-space. For each label-shuffled trial we rerun our algorithm, and collect the output. Each trial does not depend on any of the other trials, and they can all be processed in parallel. This library of label-shuffled traces produces a distribution associated with H0 (or H0x). We use this label-shuffled distribution to calculate a p-value for the traces produced by the original data.

Example-1 and Example-2:
The row-and column-traces each have one value per iteration. Two of the most straightforward measurements for each trace are (i) the maximum value, and (ii) the average value, with the latter computed using a measure where each trace-value contributes a weight proportional to the number of rows or columns eliminated during that iteration. When considering gene-expression-data these two measurements are useful and have a natural interpretation. On one hand, (i) traces that reach a high maximum during a late iteration correspond to small but tightly-correlated biclusters. On the other hand, (ii) traces that may not have an exceptionally high maximum value but are consistently above the median for many consecutive iterations correspond to larger but more loosely-correlated biclusters. The biclusters shown in Example-1a, Example-1b and Example-2 illustrate the first phenomenon (i), as shown in Figs 61, 57 and 59.
We can compare our original-data with the label-shuffled-distribution to obtain a p-value for our original-trace. We have done this using the 2-dimensional scatterplots shown in Figs 62, 58 and 60. For illustration, consider Fig 60. Each of the L = 2048 label-shuffled-traces has an x-and a y-position on this scatterplot. We'll denote the position of trace-l by (x l , y l ). Let's also denote by F (x) the fraction of label-shuffled traces with an x l < x, and by G (y) the fraction of label-shuffled traces with a y l < y. Let's further denote by x (F ) the x-position of the F th -percentile of the x l samples (i.e., x (F ) is the inverse of F (x)). Let's also denote by y (G) the y-position of the G th -percentile of the y l samples (i.e., y (G) is the inverse of G (y)). Now we can calculate the p-value of any location w = (w x , w y ) in the x, y-plane as follows.
1. Calculate F (w x ) and G (w y ), 2. Let H = max (F (w x ) , G (w y )) be the more extreme of the two. 5. Set the p-value for w to be the fraction of label-shuffled-traces in Ω: i.e., p-value = p (w x , w y ) = |Ω| /L.
We remark that this method of calculating p-values is 'self-consistent'; the distribution of p-values across the labelshuffled traces will be uniform by construction. Specifically, the collection of p (x l , w l ) taken across l will be uniformly distributed in [0, 1] and the empirical cdf of these p-values will be a diagonal line with slope 1. The more standard bonferroni-correction p = max (F (w x ) , G (w y )) is an overestimate and will not produce a uniform distribution of p-values across the label-shuffled traces when F and G are correlated.
Note that the label-shuffled null-hypothesis H0x is 'nested-within' (and more restrictive than) the null-hypothesis H0. Specifically, the collection of label-permutations allowed under H0x is a subset of the label-permutations allowed under H0. The label-shuffles which mix patients in different covariate categories (allowed under H0) are typically less likely to exhibit the same structures as the label-shuffles allowed under H0x. As a result, a smaller fraction of the trials drawn from H0 are likely to 'look like the real data', as compared with H0x. Consequently, the p-value obtained from the H0-distribution will typically be smaller (i.e., indicating a more significant result) than the more conservative p-value obtained using the more reasonable H0x-distribution.

Example-3:
When considering GWAS data the maximum and average of the trace are not useful measurements because of the varying sparsity-coefficients in the data. Because of these varying sparsity-coefficients, the trace can take on values far outside the interval [−1, 1], as is typical when there are only a few iterations remaining. Standard measurements such as the max-and average-trace are strongly influenced by the values of the trace during the final few iterations. To account for this bias, we transform each trace by replacing its value at each iteration with its z-score calculated with respect to the distribution of the label-shuffled-traces at that iteration. The max and average of the transformed-traces have an analogous interpretation to the original measures (i) and (ii) described earlier, except that the correlations revealed should be interpreted relative to the label-shuffled distribution, rather than in an absolute sense. Traces that reach a high z-score during a late iteration correspond to biclusters which, although small, are more tightly-correlated than the typical small bicluster seen at that iteration. Traces that have a high average z-score correspond to biclusters that are larger, but somewhat less correlated. The bicluster shown in Example-3 illustrates this latter phenomenon, as shown in Fig 63. 14.3 Delineating a bicluster: Example-1 and Example-2: One way to visualize the structure highlighted by the biclustering algorithm is to calculate a 'shape-matrix' S. To create S we first order the rows and columns of the original data-matrix in terms of their importance (i.e., the longest retained rows and columns come first). With this ordering we define S jk as the average trace of the 'corner-submatrix' obtained by considering the j most important rows and the k most important columns. This calculation can be performed for each j and k, revealing the subsets of rows and columns for which the trace is high. The decay of S jk as j and k increase will indicate the shape of the structure found. We llustrate some shape-matrices in Figs 64 and Figs 65, which correspond to our Example-1a and Example-1b. The shape-matrix in Fig 66 corresponds to our Example-2, while the shape-matrices in Fig 67 correspond to some other biclusters we have found in other gene-expression data-sets.
We note that, towards the end of our iteration, there are not always sufficiently many patients remaining to represent all the covariate-categories (e.g., in Fig 66 I cat = 4). When the number of remaining covariate-categories drops (e.g., when the last noncaucasian female is eliminated), our algorithm reduces the number of covariate-categories I cat involved in the score-calculation, and reduces I req as necessary. This abrupt change in I cat and I req results in the discontinuities of S in the j-direction seen in Fig 66B. When defining p-values and delineating the bicluster we only consider iterations where all the initial covariate-categories are represented.
To actually delineate a bicluster we invoke a user-specified threshold τ ∈ (0, 1); in this case τ = 0.7. Given τ , We find the j, k for which S jk is equal to the fraction τ of its maximum, and jk is as large as possible. In other words, we find the largest corner-submatrix with average-trace τ S max . As an example, the biclusters shown in Figs 64, 65 and 66 were delineated using this technique.
This procedure (i.e., picking out a corner-submatrix) does indeed do a reasonable job of delineating rectangular biclusters when they exist, and is not very sensitive to τ when the rectangular biclusters are sharply defined. Nevertheless, this kind of delineation does not always tell the whole story; corner-submatrices are not always the best indicator of the various sorts of structure that can be revealed by S in practice. For example, as seen in Figs Figs 64 and 67, our algorithm can often reveal patterns within gene-expression data that are not quite rectangular. These shape-matrices may indicate some of the more complicated underlying structures discussed in section 7.3. Thus, in summary, it is important to note that bicluster-delineation will always be somewhat arbitrary; in practice there will rarely be a clear-cut boundary to any bicluster. Because of this arbitrariness, we have structured our methodology so that the details regarding bicluster-delineation do not affect the bicluster's p-value (which only depends on the traces, and not the threshold τ ).

Example-3:
The S-matrix described above is useful when the goal is to pick out a small sharply-correlated bicluster -i.e., when the traces output by our algorithm are significantly high towards the end of the iterations. However, if the output-traces are not high towards the end of the iterations, then this S-matrix may not be particularly meaningful. This situation arises in Example-3, where the output-traces are significantly higher than the H0x-distribution for the first several iterations, but not towards the end (see Fig 63). If we were to try and use the S-matrix to to delineate this bicluster, then we would need to compare the S jk produced by the original-data with the S jk produced by the random-shuffles from H0x; the former would only be significantly greater than the latter when j is in the thousands and k is in the hundreds-of-thousandsclearly not a sharp delineation of any kind. Thus, for Example-3, we choose to delineate the bicluster by using the last peak of the output-trace relative to the H0x-distribution (i.e., the last peak of the z-score). This final peak occurs with 115 case-patients and 706 alleles remaining (i.e., the submatrix shown in Fig 11).

Finding secondary biclusters:
So far we have been concerned with finding a single bicluster hidden within a large data-matrix. In practice of course, many large data-sets are home to multiple biclusters, some of which are distinct and some of which overlap to varying degrees. . . , k}. The entry of S for j = 65 and k = 3984 corresponds to the submatrix within the cyan box, and is indicated with a cyan '×'. This shape-matrix reveals a rather smeared-out bicluster; while there are many rows and columns that clearly participate, the shape of the bicluster is not obviously rectangular. Consequently, the delineation of a 'corner-submatrix' will depend somewhat sensitively on the methodology used (e.g., the particular value of τ we use to choose the corner). Various shape matrices for other biclusters Figure 67: Illustration of some other shape-matrices we obtained when running our algorithm on various gene-expression data-sets. Some of these shape-matrices suggest rectangular biclusters, but others suggest more 'smeared-out' structures, some similar to the kinds of structures discussed in section 7.3.
What does our algorithm do when the data-set has multiple biclusters, and how do we find these multiple biclusters? As we've discussed above in section 4, our algorithm usually finds the most 'dominant' bicluster within the data; that is, the bicluster that produces the greatest signal (i.e., the largest and/or most tightly correlated bicluster). This first most dominant bicluster will often be highlighted by the corners of the shape matrix 'S' described in section 14.3. In some cases secondary biclusters will also be highlighted by the shape matrix, trailing the first (see Fig 33).
Situations as clean as Fig 33 don't usually arise in practice, however, and we've found it useful to search for secondary biclusters using repeated passes of our algorithm. More specifically, after running our algorithm for the first time (and delineating the most dominant bicluster), we then rerun our algorithm from scratch, scrambling the data within the first detected bicluster to destroy any signal it might create. This second pass of our algorithm will then locate the second most dominant bicluster. After this we can perform a third pass, scrambling the first two biclusters and so forth.
Note that when we scramble the data within the biclusters we have found so far, we should respect the constraints compatible with our null hypothesis (e.g., H0, H0x, etc.). For example, if we were considering categorical-covariates we should scramble the data within a bicluster by randomly interchanging/replacing matrix-entries within (but not across) covariate-categories. This procedure ensures that we destroy (most of) the signal associated with the previously discovered biclusters without introducing any large new spurious signals.
We remark that secondary passes of our algorithm only rarely produce traces that are as high as those produced by the first pass (after all, the secondary biclusters typically have a weaker signal than the first). Nevertheless, we currently believe that it is best to assess the p-value of secondary biclusters by comparing their traces with the original shuffleddistribution of traces used to assess the most dominant bicluster (rather than with a new shuffled-distribution associated with second passes of our algorithm). We adopt this methodology because it is conservative: we use the same standard to assess both secondary and primary biclusters, even though this standard may underestimate the significance of secondary biclusters.
The results of this approach for Example-2 are shown in Fig 68. Note that the first trace is the most pronounced. This phenomenon is typical for the gene-expression data sets we have analyzed; while certainly not impossible, it is rare for a secondary bicluster to be more significant than a previously found bicluster.

Additional Applications
We have focused so far on the detection of low-rank biclusters. The simplest situation involves locating a 'rank-1' bicluster B hidden within an M ×N data-matrix D. Because the bicluster B can be approximated with a single outer-product u·v ⊺ , we have chosen to consider loops (i.e., 2 × 2 submatrices) within D. The low-rank structure of B ensures that loops within B are likely to be rank-1 (rather than rank-2). Moreover, the rows and columns of D that intersect B will contribute to many more rank-1 loops than the rows and columns that don't. Using these observations, our algorithms accumulates the loops within D; exposing B in the process. We've also discussed how to search for low-rank biclusters within a data-matrix involving cases and controls, considering along the way how to account for sparsity as well as categorical-and continuous-covariates. Our methods can be easily extended to tackle many related problem. For example, we can treat 'genetic controls', look for 'rank-0' biclusters (i.e., differentially-expressed biclusters), and even look for 'triclusters'. We discuss these topics briefly below.

Accounting for genetic-controls:
In some applications the goal is to find biclusters that are not merely specific to a subset of case-rows, but also to a subset of case-columns. Such a situation may arise in gene-expression analysis when the goal is to find biclusters restricted to one set of genes that do not extend to encompass the gene-expression measurements taken from another set of genes. As an example, let's return to the Example-2 in section 1.3. The particular bicluster we already found in section 1.3 involved many genes and pathways that are already known to affect colon-cancer; these genes are involved in many pathways related to DNA-replication, DNA-repair, mitosis, etc. Instead of focusing on these well-known pathways, we might want to deliberately ignore them and search for signals within other genes; hopefully discovering genes that participate in less well understood pathways.
One way to go about this is to classify some of the genes in our data-set as 'control-genes'; i.e., genes that we deliberately want to avoid when searching for low-rank structure. For this example we'll use as genetic-controls a list of genes that are already well-recognized to affect colon-cancer (see the discussion towards the end of sec 1.3). To recap: we generate our set of control-genes by first choosing three of the most well documented genes which influence colon-cancer: MLH1, MSH2 and MSH6. Our second step involves using 'Seek' (see [M42]) to generate a list of genes that are commonly co-expressed alongside these three cancer-related genes. We then define our control-genes to be all the genes in this list which have a combined co-expression value (with MLH1, MSH2 and MSH6) of at least 0.75. The control-gene set generated this way comprises 1659 of our original 17941 genes. As we alluded to above, this control-gene set is strongly enriched for DNA-replication, DNA-repair, mitosis, etc. (e.g., p-values on the order of 1e-20 or below); these are the pathways we want to ignore. We'll classify the remaining 16282 genes as 'case-genes', and search for a bicluster within these case-genes that does not exhibit a low-rank structure that is shared by a significant fraction of the control-genes.
We can divide our measurements into the following data-matrices: • the M D × N D case-matrix D: The entry D j,k is the k th case-gene measurement from the j th -case patient.
• the M X × N D control-matrix X: The entry X j,k is the k th case-gene measurement from the j th -control patient.
• the M D × N Y control-matrix Y : The entry Y j,k is the k th control-gene measurement from the j th -case patient.
• the M X × N Y control-matrix W : The entry W j,k is the k th control-gene measurement from the j th -control patient.
Our goal will be to find co-expressed subsets of genes within the case-population; we would like to find a low-rank bicluster within D. As before, we are most interested in biclusters which are case-patient-specific and don't extend to include many control-patients from X. In addition, we would also like to focus on biclusters which are case-gene-specific, and don't extend to include many gene-expression measurements from the control-genes that are commonly co-expressed with MLH1, MSH2 and MSH6 (i.e,. Y ). Put another way, we are interested in finding co-expressed subsets of genes that are (as best as possible) isolated to the case-patient-population and restricted to the 'undiscovered' subset of case-genes.
In order to accomplish this goal, we can extend the techniques used above, counting loops as before. This time, however, we'll also count loops that go through D and Y , as well as loops that go through all four data-matrices D, X, W and Y . Similar to the situation with both cases and controls (see section 8), we'll flip the sign on loops passing through X, and also flip the sign of loops passing through Y . Consequently, we'll end up positively counting loops passing through W . One way of understanding this choice of signs is that (i) we want to reward low-rank loops that are fully contained in D, and (ii) penalize low-rank loops that extend from D to X or from D to Y . The reason we want to reward low-rank loops that pass through all four data-matrices (i.e., through W ) is that we want to ensure that biclusters are not unfairly penalized when the data-set has ambient correlations. Put another way, we would like the average row-and column-scores of random data (with no planted biclusters) to be 0. We choose the sign associated with the W -loops so that this average score remains 0 even when the random data is drawn from a non-isotropic gaussian (rather than an isotropic gaussian).
The row-scores built using this intuition take the following form (see Fig 69): and are combined as follows: The column-scores, on the other hand, take the following form: and are combined as follows: A simple example illustrating this scenario is shown in Figs 70 and 71, where we consider a variation on the plantedbicluster problem involving both a planted case-specific bicluster and various planted non-specific biclusters involving both the patient-and genetic-controls.
When we actually carry out this method on the data from section 1.3 we do indeed find a significant bicluster, shown in Figs 72-75. Not only is this bicluster significant with respect to the label-shuffled hypothesis, but it is also enriched strongly for many different pathways. The most strongly enriched pathways include substrate-specific channel-activity (p=3e-8), passive-transmembrane-transporter-activity (p=1e-7) and cation-transport (p=1.5e-6); all of which are distinct from the most prominent pathways affiliated with the control-genes (as desired).  Figure 69: Illustration of the different kinds of loops that contribute to the row-score in the presence of genetic-controls.
In the left panel we show some loops which contribute to the row-score; on the right we show some loops which contribute to the column-score. In each panel we illustrate an M D × N D case-matrix D (upper left) and an M X × N D control-matrix X (bottom left). In addition, we illustrate an M D × N Y matrix of genetic-controls Y (upper right) and an M X × N Y matrix of genetic-controls for the control-patients W (bottom right). Some of the loops are contained entirely within D, some extend from D to X, some extend from D to Y , and some extend to include D, X, W and Y . We count the first and last type of loop positively if they are rank-1, and negatively if they are rank-2. We count the second and third type of loop positively if they are rank-2, and negatively if they are rank-1.

Searching for 'rank-0' (i.e., differentially-expressed) biclusters:
As mentioned above, our algorithms accumulate the loops within D. While conceptually straightforward, the accumulation of these loops requires a matrix-matrix multiplication, limiting the efficiency of our approach (i.e., requiring at least O(max(M, N )M N ) work). If, instead of attempting to find generic rank-1 biclusters, we restrict our search to a particular kind of rank-1 bicluster, we can overcome this limitation and make our algorithms even more efficient (e.g., requiring only O(M N ) work). Specifically, we can greatly reduce the run-time of our algorithm if we decide to search only for special rank-1 biclusters B ∼ 1 · v ⊺ ; i.e., where the column-vector 'u' has been replaced by a vector of all-ones. These biclusters are characterized as follows: each column of the bicluster is either high (i.e., mostly above the median for that column) or low (i.e., mostly below the median for that column). This means that, after binarization, the 'high' columns of the bicluster are mostly '+1', and the 'low' columns are mostly '−1'. Put another way, the rows/columns of the bicluster clump together near a single point in high-dimensional space; the bicluster is effectively rank-0. These biclusters can also be thought of as 'differentially-expressed': the genes in the biclusters are each differentially-expressed when comparing the patients in the bicluster to the rest of the population.
To detect these rank-0 biclusters we don't actually need to consider full loops; simpler half-loops will suffice (see Fig  76). The accumulation of half-loops is substantially easier than the accumulation of full loops, and does not require a matrix-matrix multiplication. Consequently, our methodology can be applied to the case of rank-0 biclusters with a reduced computational cost of only O(M N log(max(M, N ))) work.
All of our corrections involving sparsity, categorical-and continuous-covariates work just as well here. The formulae in section 12 can be carried forward, replacing all the loops with the associated half-loops. Specifically, we redefine our base-scores as follows: These numerical experiments illustrate the performance of our algorithm on a simple planted-bicluster problem involving both case-and control-patients, along with case-and control-genes. The case-genes associated with the case-and control-patients are stored in matrices D and X, respectively. The control-genes associated with the case-and control-patients are stored in matrices Y and W , respectively. For each numerical experiment we let D be a large M × M random matrix with entries chosen independently from a distribution with median 0. We construct X, Y , and W in a similar fashion, except that X, Y and W are 2M × M , M × M and 2M × M , respectively. For each experiment we implant within the case-matrix D a small m × m bicluster B1 that is rank-1 with error ε. The rows and columns of B1 are chosen at random. We also implant a second rank-1 error-ε 'red-herring' bicluster B2 within both the case-matrix D as well as the control-matrix X. To implant B2 we first randomly choose 3m/2 case-genes K and 3m/2 case-patients J D from D, as well as 3m control-patients J X within X. We then construct a 3 · 3m/2 × 3m/2 bicluster B2 that is rank-1 with error ε, and implant the top third of B2 into the chosen case-patients J D and the bottom two-thirds of B2 into the chosen controls J X (in each case using the chosen case-genes K). We implant another bicluster B3 in a similar fashion, except that B3 extends across both D and Y , and the total size of B3 is 3m/2 × 2 · 3m/2. To implant B3 we first randomly choose 3m/2 case-genes K D from D and another 3m/2 control-genes K Y from Y , as well as 3m/2 case-patients J from D. After constructing the rank-1 error ε matrix B3, we implant the left half of B3 into the chosen case-patients J and case-genes K D , while implanting the right half of B3 into the chosen case-patients J and control-genes K Y . Finally, we implant a fourth bicluster B4 which extends across D, X, W and Y . The bicluster B4 is of total size 3 · 3m/2 × 2 · 3m/2, and we create and implant it in a manner analogous to biclusters B2 and B3 above. We remark that our illustration is somewhat misleading, as the biclusters B1, B2, B3, and B4 are not typically disjoint. Because the row-and column-subsets for these biclusters are chosen randomly, there is typically a small overlap between them. For the purposes of this numerical experiment we implant each bicluster in the order described above: any overlapping entries will be overwritten by the next bicluster in the list (with biclusters B2, B3, B4 taking precedence). For each numerical experiment we calculate the average auc A 1 for B1, A 2 for B2, and so forth. Our goal is to find B1, despite the masking effect of the larger biclusters. case-patients and M X = 120 control-patients; colorectal cancer was determined to be the significant cause of death for the former, but not the latter. We also divide the 17941 genes into N D = 16286 case-genes and N Y = 1659 control-genes; the control-genes are those that are significantly co-expressed alongside genes MLH1 MSH2 and MSH6, whereas the casegenes are the remainder. We use these patient-and gene-classifications to delineate the case-matrix D, as well as the control-matrices X, Y and W . The covariates (gender and ethnicity) for each patient are shown to the far right, (i.e., grey vs black). Several such half-loops are highlighted via the thin blue rectangles (the top and bottom of each rectangle pick out a 2 × 1 submatrix). Generally speaking, the entries of a half-loop are equally likely be matched or mismatched. Some half-loops, such as the half-loop shown in red, are entirely contained within B. These half-loops are more likely to be matched than mismatched. In Panel-B we show some examples of mismatched and matched half-loops. Given a half-loop with row-indices j, j ′ and column-index k, the parity of the half-loop is determined by the sign of D jk D ⊺ kj ′ . Our algorithm accumulates a 'half-loop-score' for each row j and each column k. The half-loop-score for a particular row j is given by The half-loop-score for a column k is given by In Panel-C we show the distribution of half-loop-scores we might expect from the rows within D. The blue-curve corresponds to the distribution of scores expected from the rows/cols of D that are not in B, whereas the red-curve corresponds to the distribution of scores expected from the rows/cols of B. and redefine the (2-sided) covariate-dependent sub-scores for each t = 1, . . . , N T as follows: After substituting these new definitions, the subsequent combination of sub-scores remains unchanged. We pause to remark that we can also redefine our 1-sided covariate-dependent sub-scores as follows: After this redefinition, we use the same subsequent combination of sub-scores as discussed in section 12 (i.e., the combination appropriate for a 1-sided covariate-correction). One advantage to the half-loop algorithms described above is that they are very easy to analyze. In fact, much of our previous analysis remains unchanged; the main difference being that we replace the probability g l,ε,m with the new probabilityĝ ε . Recall that we previously defined the former to be the probability that a loop drawn from a low-rank submatrix is rank-l, whereas we now define the latter to be the probability that a half-loop drawn from a rank-0 submatrix is of matched sign. While g l,ε,m depends on both the noise-level ε as well as the submatrix size m, the probabilityĝ ε only depends on the level of noise, and not on the size m. To generate a rank-0 submatrix with noise level ε, we draw each entry independently and randomly to be either ±1 with a probability chosen so that g 1,ε,m equalsĝ ε .
A second advantage to the half-loop algorithms is that they can be used to search for differentially-expressed biclusters where each gene is either high across the cases and low across the controls, or vice-versa. These structures are -by definition -rank-1 structures that extend across both the case-and control-populations; they were deliberately ignored by the control-correction described in section 8, but can easily be found using the half-loop algorithms above.
A simple example illustrating this approach is shown in Figs 77 and 78. A more complicated example is shown in Figs 79 and 80. These two examples are structured almost identically to the numerical experiments described in Figs 24 and 54, the only difference being that we implant rank-0 biclusters, rather than merely low-rank biclusters. As these examples illustrate, our half-loop algorithm is not quite as good as our full loop-counting algorithm. Nevertheless, it is an order of magnitude faster, a feature that may be of interest when analyzing large data-sets.

Searching for triclusters:
There are often situations where the data-paradigm we've assumed -involving N measurements taken across M patientsdoesn't suffice. For example, in a clinical study there may be M patients, each of which are subjected to P different kinds of treatments (e.g., therapy regimes, medications, etc). For each of these P treatments, N variables may be measured for each patient. Within this paradigm, the data isn't best represented as a 2-dimensional array (e.g., a matrix), but rather as a 3-dimensional array (i.e., a box or a cube) comprising both rows and columns, as well as 'layers'. In formal terms, we  Fig 24, with the exception that the planted bicluster is not merely low-rank, but rather rank-0. A rank-0 bicluster is characterized by its size, as well as by the probabilityĝ ε that a half-loop drawn from the bicluster will be of matched sign. We generate our rank-0 biclusters by choosing each entry independently and randomly so that the probabilityĝ ε is equal to the g 1,ε,m we would associate with a rank-1 biclster.   In panel-C we show the trial-averaged value of A 4 , A 5 , A 6 . The left side of each panel displays results with I req = 3, the right side with I req = 2. Note that, when I req = 3 the goal was to find B0, as this was the only case-specific bicluster that was balanced across all the continuous-covariates and covariate-categories. On the other hand, when I req = 2 the goal was to find the most dominant of B0 B1 B2 and B3, as each of these case-specific biclusters were balanced across all the continuouscovariates while extending across at least 2 covariate-categories. Note that our half-loop algorithm does a modest job of achieving these goal, even in the presence of the other biclusters B4 B5 and B6. Note also that this half-loop algorithm is not quite as good as our original loop-counting algorithm (compare with Fig 55).  imagine our data arranged into an array D of dimension M × N × P , where D j,k,l corresponds to the k th measurement of the j th -patient as they undergo therapy l.
Within this 3-dimensional array, it is often prudent to search for subsets of patients that exhibit some kind of simple structure across a subset of measurements as well as a subset of treatments. Such a 'tricluster' would correspond to a 'sub-cube' of the data, rather than simply a submatrix. The techniques we discussed earlier can readily be extended to search for these kinds of objects as well.
For exposition, let's consider the 'planted-tricluster' problem. We'll assume that the data-array D contains a hidden m × n × p sub-array B involving patient-subset J B , measurement-subset K B , and treatment-subset L B (i.e., J B , K B and L B are unknown). The sub-array B will exhibit structure not exhibited by the rest of D, and our goal will be to locate the tricluster B (i.e,. to find the row-column-and layer-subsets J B , K B and L B ).
We'll assume that each entry of D that is not in B is drawn independently and randomly from, say, a median-0 distribution, but that the entries of B are drawn from a more structured distribution. For simplicity let's assume B is formed from a collection of outer-products; one for each pair of array-dimensions: where u 1 , u 2 ∈ R m , and v 1 , v 3 ∈ R n , and w 2 , w 3 ∈ R p . are each randomly chosen (but unknown) vectors that describe the low-rank structure within B.
As before, we'll binarize D, sending each entry to either +1 or −1, depending on its sign. Once we binarize D, we can calculate the following scores: corresponding to the row-, column-and layer-scores, respectively. Once we've calculated these scores, we can remove the rows, columns and layers with low scores, and repeat the entire process. As before, this process will focus on the rows, columns and layers of B with high probability as long as m √ M , n √ N and p √ P . The reason this process works is that -as before -the scores accumulate the ranks of the various loops within D. However, unlike the simpler situation discussed in section 2, D is a 3-dimensional array (and not merely a matrix).  Fig 14, and shows the distribution of continuous-covariate components across the remaining case-patients as our algorithm proceeds. As before, our algorithm ensures that the covariate-distribution remains relatively well-balanced.
Number of rows remaining early iterations late iterations fraction of cases remaining sorted by study Relative abundance of each study as algorithm proceeds: continuous-covariate correction 9752 5947 3626 2211 1349 823 306 502 Figure 85: This figure has the same format as Fig 18, and shows the relative abundance of each study as our algorithm proceeds. As before, our algorithm is able to ensure that the remaining case-patients are drawn from a reasonable mixture of the studies.
Consequently, there are 3 different kinds of loops within D, each traversing a different pair of array-dimensions (see Fig  86). Any loop within D that does not lie entirely within B is just as likely to be rank-1 as it is to be rank-2. On the other hand, loops within D that are entirely contained within B are more likely to be rank-1 than rank-2. This probability is not 100%, but it is still significantly greater than 50%, with the exact value dependent on how the u 1 , u 2 , . . . , w 3 are drawn. Moreover, this probability is still significantly greater than 50% even in the presence of a moderate amount of noise (e.g., if B were not exactly a sum of outer products). A numerical experiment corroborating this intuition is shown in Fig 87. Techniques along these lines have been used to find triclusters in clinical data involving several patients, measurements and therapies. See [M44] for some preliminary results.

Various properties of multivariate gaussians
This section serves as somewhat of an appendix to section 2, collecting various observations about gaussian distributions.
Note that the gaussian distribution ρ σ satisfies: Consequently, the first few moments of ρ σ are given by: implying that, if x is drawn from ρ σ (x), then the mean of x 2 will be σ 2 and the variance of x 2 will be 3σ 4 − σ 4 = 2σ 4 . One can also show that the 1-dimensional fourier-transform F of a 1-d gaussian is another 1-d gaussian:

M -rows
N -columns P -l a y e r s iso layer iso column iso row Figure 86: Illustration of the loops within a 3-dimensional array. We sketch the structure of a 3-dimensional data-array D, with J rows, K columns and P 'layers'. Each entry D j,k,l will lie in the cube shown. The loops within D can be divided into 3-categories: (a) iso-layer loops that stretch across 2 rows and 2 columns, (b) iso-column loops that stretch across 2 rows and 2 layers, and (c) iso-row loops that stretch across 2 columns and 2 layers. The row-score [Z ROW ] j aggregates all the iso-column and iso-layer loops associated with row-j. The column-score [Z COL ] k aggregates all the iso-row and iso-layer loops associated with column-k. The layer-score [Z LYR ] l aggregates all the iso-row and iso-column loops associated with layer-l. Each entry of D is chosen independently and randomly from a distribution with median 0. The tricluster B is constructed by first forming a sum of three outer products: B j,k,l = u 1 j v 1 k + u 2 j w 2 l + v 3 k w 3 l , and then flipping the sign of a randomly-chosen fraction f of the entries of B. We choose the sign-flipping probability f as described in section 5.1 (i.e., f depends on ε √ m). Our loop-counting algorithm produces a list of row-, column-and layer-indices of D in the order in which they are eliminated; those indices retained the longest are expected to be members of B. For each numerical experiment we calculate the auc A R (i.e., area under the receiver operator characteristic curve) associated with the row-indices of B with respect to the output list from our algorithm. We calculate the auc A C and A L for the columns and layers similarly. Finally, we use A = (A R + A C + A L )/3 as a metric of success; values of A near 1 mean that the rows, columns and layers of B were retained the longest by our algorithm. Values of A near 0.5 mean that our algorithm failed to detect B. On the right-hand side we show the trial-averaged value of A as a function of ε √ m and log M (m) for a few values of M . Note that our loop-counting algorithm is generally successful when ε √ m is sufficiently small and m is sufficiently large.

convolution of two one-dimensional gaussians:
It is well known that the convolution of two one-dimensional gaussians is yet another gaussian: This can be understood by noting the fourier-transform above and appealing to the convolution theorem (i.e., . This can also be understood by considering a random vector x N with N independently chosen entries, each either −1 or +1. Obviously, any particular entry of x N has mean 0 and variance 1. Therefore the sum S N = n=N n=1 x N n is a random variable with mean 0 and variance N ; when N is large S N will be drawn from ρ √ N . Now, considering a situation where α 2 + β 2 = γ 2 , we immediately see that S γ 2 must be drawn from ρ γ , whereas S α 2 and S β 2 are drawn from ρ α and ρ β respectively. Given that there is a one-to-one mapping between vectors x γ 2 and vectors x α 2 ⊕ x β 2 , the random variables S γ 2 and S α 2 + S β 2 must have the same distribution. Consequently, ρ γ must be equal to ρ α ⋆ ρ β .

two-dimensional gaussian:
We also use ρ a,b,θ ( x) to refer to the mean 0 anisotropic multivariate gaussian distribution in 2-dimensions with variances a 2 and b 2 , and first principal-component oriented at angle θ. That is to say: where R θ = cos θ − sin θ + sin θ cos θ .
In terms of visualization, the contours of ρ a,b,θ are ellipses with one principal axis pointing in the θ-direction, and the other principal axis perpendicular to the θ-direction; these elliptical contours will have principal radii proportional to a and b, respectively. Using our notation, we see that ρ a,b,0 ( x) is separable: . Thus, we can think of ρ a,b,θ ( x) as: These observations can be easily used to show that the 2-dimensional fourier-transform F of a 2-d gaussian is yet another 2-d gaussian: Note that ρ 1/a,1/b,θ has the same orientation as ρ a,b,θ , but with inverted principal-values.

restriction of two-dimensional gaussian to quadrants:
In general, given a gaussian ρ a,b,θ , the fraction p of this gaussian restricted to the first and third quadrants is This can be determined by first observing that p is the integral of ρ α,b,θ over the domain Ω with boundaries given by the x and y axes, subtending an angle of π: With this definition of Ω, we can write: Given this formulation, we can first rotate space (and Ω) by −θ: and then dilate space (and R ⊺ θ Ω) by Σ = diag(1/a, 1/b): Within this final formulation the integrand is a uniform isotropic gaussian ρ 1,1,0 , and so p (a, b, θ) is simply 1/2π times the angle subtended by the transformed domain Ω ′ = ΣR ⊺ θ Ω. This angle in turn can be determined by applying ΣR ⊺ θ to the boundaries of Ω (i.e., to the x and y axes), yielding the formula above.

approximation of orthonormalû,v:
If we randomly draw the numbers u 1 , . . . , u m and v 1 , . . . , v m independently from ρ √ 1/m , then each u 2 j and v 2 j will be drawn from a distribution with mean (1/m) and variance 2/m 2 . Thus, when m is large, the random variables |u| 2 and |v| 2 will be drawn from a gaussian-distribution with mean 1 and variance 2/m. Each term u j v j , on the other hand, will be drawn from a gaussian distribution with mean 0 and variance 1/m 2 , implying that the random variable u · v is drawn from ρ 1/m . Thus, the unit vectorû = u/ |u| is likely within of v, and will be orthogonal toû. Similarly, the unit-vectorv =ṽ/ |ṽ| will also be orthogonal toû and within O (1/ √ m) of v. Given this construction, the unit vectorsû andv will be orthonormal and uniformly distributed (conditioned on the fact thatû ·v = 0). In other words, the original vectors u and v were close to orthonormal to begin with.

convolution of two two-dimensional gaussians:
Given random variables x and y drawn from ρ a,b,θ and ρ c,d,φ , the sum z = x + y will be drawn from ρ a,b,θ ⋆ ρ c,d,φ (i.e., the distribution obtained by convolving ρ a,b,θ and ρ c,d,θ ) This distribution will also be a gaussian, of the form ρ f,g,ω , for some f, g, ω. This latter distribution can be understood via the fourier-transform. That is to say: This distribution ρ f,g,ω can be determined by observing: Thus, the distribution ρ f,g,ω can be determined by finding the f, g, ω that satisfy the following equations (drawn from the exponents in the previous equalities): These equalities can be rearranged into the following: (2) a 2 − b 2 sin 2θ + c 2 − d 2 sin 2φ = f 2 − g 2 sin 2ω Thus, given 2θ, a 2 − b 2 , 2φ and c 2 − d 2 , we can use the first two equalities to find 2ω and f 2 − g 2 (see, e.g., Fig 88).
Once this is accomplished we can use the third equality to find f 2 + g 2 .
a Because ε = 0, we construct the matrix A as follows: where U ∈ R m×l and V ∈ R n×l , and A = U V ⊺ is a singular-value-decomposition. Defining u j = U j: and v k = V k: to be the j th -row of U and the k th -row of V , respectively, we can write A as: This allows us to write B as: and [B ⊺ B] as: For any particular [B ⊺ B] kk ′ the row-vectors v k and v k ′ are fixed. Any particular summand (corresponding to a particular j) will be +1 if that row-vector u j makes an acute angle with both v k and v k ′ . Similarly, a summand will be +1 if u j makes an obtuse angle with both v k and v k ′ . On the other hand, a summand will be −1 if and only if u j makes an acute angle with one of the v k , v k ′ , and an obtuse angle with the other. If m is large, we can assume that each of the u j are independent and randomly oriented. Thus, if we let θ ∈ [0, π] be the angle between v k and v k ′ , then any particular u j has probability θ/π of contributing a summand of −1 to [B ⊺ B] kk ′ . Conversely, any particular u j has probability 1 − θ/π of contributing a summand of +1. Thus, given a fixed set of v k and v k ′ , the expected value of [B ⊺ B] kk ′ will be m (1 − 2θ/π).
In the cases where l ≥ 2, the normalization factor Z l is equal to: Note that the first few Y q are given by: Using this construction we can calculate the expected value of magnitude of [B ⊺ B] kk ′ , assuming that k = k ′ : Noting that: π/2 0 sin q (θ) dθ = Y q , and π/2 0 θ · sin q (θ) dθ = X q , with we see that: Evaluating this expression for the first few l gives: While the following is not an exact formula, when l ≥ 2 this expected value is quite close to: 19 Information-theoretic phase-transitions for rank-1 planted-bicluster problem In this section we discuss in more detail the information-theoretic size-and noise-transitions for the rank-1 plantedbicluster problem. This problem involves sifting through an M × M binary data-matrix D, and distinguishing between the following two hypotheses.
H0: The null hypothesis. In this hypothesis the data-matrix D has no planted bicluster within it; each entry of D is chosen randomly and independently from {−1, +1}.

H1:
The planted-bicluster hypothesis: In this hypothesis the data-matrix D is altered by planting within it a smaller m × m submatrix which corresponds to the binarization of a randomly-oriented rank-1 error-ε bicluster. The entries of D outside of this m × m submatrix will be chosen randomly and independently as before, but the entries of D within this m × m submatrix will depend on one another. One step towards understanding the planted-bicluster problem is to determine when H1 is distinguishable from H0; i.e., how large m needs to be (and how small ε must be) before the planted-bicluster impacts the distribution of the elements of D. To address this question we assume that D is drawn from H0, and ask what kinds of structures might naturally exist within D, even without any deliberate attempt to plant any specific biclusters. As we'll see below, even under H0 we still expect to find within D a vast number of biclusters. These biclusters include (a) small noiseless biclusters of size m 2 log 2 M and (b) much larger and noisier biclusters of size m ∼ √ M and ε [π log M ] −1/2 , as well as many structured elements in between these two extremes. Nevertheless, under H0 we are exponentially unlikely to find biclusters that are both large and low-noise.
Size phase-transition: Let's assume the noise ε = 0 for the moment, and consider the limit M ≫ m ≫ 1. Let's take a particular m × m submatrix B within D. The requirement that B be rank-1 places constraints on the signs of (m − 1) 2 elements of B. That is, B will be rank-1 with probability p 0 = 2ˆ − (m − 1) 2 , or approximately -log p 0 ≈ log 2 · m 2 . The total number of such m × m submatrices within D is Z = M m 2 , which can be approximated as Z ≈ exp 2m log (M/m) + 2m + 2m 2 /M , or approximately log Z ≈ 2m log M . The product P 0 = 1−(1 − p 0 ) Z ≈ Zp 0 provides an upper bound on the probability that D contains at least one m×m rank-1 submatrix (this is technically an overestimate because the ranks of each of the Z submatrices are not independent of one another). Obviously, this upper bound P 0 tends to 0 when − log p 0 ≫ log Z, which occurs when m is larger than [2/ log 2] log M = 2 log 2 M . This means that, for M ≫ m ≫ 1, a random binary-matrix D is exponentially unlikely to contain any m × m rank-1 matrix, as long as m 2 log 2 M . One consequence of these observations is that, if we were to embed within D a submatrix B of size m in the range 2 log 2 (M ) < m < √ M , say m ∼ 3 √ M , then B would be very significant but still undetectable to our algorithm.
Noise phase-transition: Now let's consider the second phase-transition, this time assuming M ≫ m ≫ log M . As before, we'll assume that D is an M × M binary-matrix with each entry chosen independently from {−1, +1}, and we'll consider m × m submatrices B within D. This time we are interested in the probability p ε √ m that a particular B is rank-1 with noise at most ε √ m. As we pointed out above, when ε √ m = 0 we impose constraints on the signs of ≈ m 2 elements of B; any particular B will be exactly rank-1 with probability p 0 ≈ 2ˆ −m 2 . If instead we require B to be rank-1 with noise ε √ m, we actually only impose constraints on ≈ 1 − 2f ε √ m m 2 randomly chosen elements of B (where f ε √ m is the 'flip-probability' discussed in section 5.1). Assuming that ε √ m 10, we can approximate f ε √ m with An asymptotic expansion in terms of ε √ m reveals that, within the large-ε √ m regime: Thus, we see that: Comparing this with log Z = 2m log M − 2m (log m − 1), we see that If we assume that m ∼ √ M , then P ε √ m will vanish when Consequently, when ε √ m 10 and m ∼ √ M , we can expect that P ε √ m will be essentially 0 whenever
By contrast, the angle ω of a 1 × 2 vector drawn from outside B is uniformly distributed (i.e., ρ original D (ω) = 1/2π). These two distributions can be used to calculate the relative-entropy D original

Analysis of loop-scores
In this section we analyze our loop-scores and show that they are, in a certain sense, close to optimal within the context of our methodology. In this section we'll use the notation l [J; K] to refer to a c × c-submatrix involving rows J = {j 1 , . . . , j c } and columns K = {k 1 , . . . , k c }. To ease the discussion, we'll refer to these c × c submatrices as 'nets'. Our analysis begins by noting that our original loop-score was just one of a large class of 'c-scores' which are constructed using the following general template: L j χ is the total number of nets intersecting row j that contain exactly χ positive entries. In other words, the final c-score for row j takes the form [Z ROW ] j = µ · L j , where µ ∈ R 1+c 2 is the vector formed from the various µ (χ), and L j ∈ N 1+c 2 is the vector formed from the L j χ . The difference between the c-score for some row j and the c-score for another row j ′ depends only on the difference between L j and L j ′ .
To understand the structure of L j more clearly, let's restrict our attention to a fixed column-subset K = {k 1 , . . . , k c }. Given that column-subset K is fixed, we can consider various row-subsetsJ = {j 1 , . . . , j c−1 }, each chosen such that j / ∈J andJ ∩J B = ∅. For each choice ofJ, the (c − 1)× c sub-net l J ; K will have some number of positive entries χ l J ; K . This collection of values for χ (taken across the variousJ) can be encoded as a vector P K ∈ N 1+(c−1)c (i.e., P K χ is the number of row-subsetsJ such that χ l J ; K = χ). Now the (c − 1) × c sub-net l J ; K can be can be expanded by adding the row l [j; K] to form l j ∪J; K . Obviously, χ l j ∪J; K is equal to χ (l [j; K]) + χ l J ; K , implying that the collection of values for χ l j ∪J; K (taken across the variousJ) can be determined simply by embedding the vector P K in N 1+c 2 and shifting it forward by χ (l [j; K]) indices. That is to say, the collection of values for χ l j ∪J; K is simply P K ⋆ e χ(l[j;K]) , where '⋆' denotes a vector-convolution and e χ is a vector with entry χ equal to 1 and the other entries equal to 0.
With this notation the vector L j can be written as: where the vector P K is calculated by summing over all row-subsetsJ that exclude j. At this point we are ready to compare the row j, which we'll assume lies in J B , with another row j ′ taken from outside B. We can immediately see that: where the vector P K is calculated by summing over all row-subsetsJ that exclude both j and j ′ (i.e., nets which include both j and j ′ will contribute to the same χ-index within L j and L j ′ ). Recalling our discussion above, we need only consider nets l [J; K] that have an overlap x = |J ∩ J B |, and y = |K ∩ K B | of at most 1. This implies that we can calculate the vector P K by summing over only those row sub-setsJ that exclude J B as well as j ′ . With this assumption each l J ; K is disjoint from B, and each χ l J ; K will be drawn from the same binomial distribution Thus the expected value of K P K is merely the binomial distribution above multiplied by the number of possibleJ: Based on the linearity of expectation and the independence of P K and χ (l [j, K]) and χ (l [j ′ ; K]), we have that: Thus, the ability for Z ROW to discriminate between j and j ′ is limited by the difference between the distributions E K j and E K j ′ . A calculation of their relative-entropy shows that these two distributions E K j and E K j ′ are maximally differentiable when c = 1. When c = 1 the distribution E K P K χ is trivial and the choice of µ is irrelevant (so long as µ (0) = µ (1)). In this case the best c-score for each row is simply the number of positive entries in that row. This is equivalent to simply using the 'degree' of each row (treating the matrix D as the adjacency matrix of a bigraph). A similar statement holds for the columns: The best c-score for each column is simply the number of positive entries in that column. This simple score -i.e., counting degree -is only informative when m √ M .
Putting all this together, we see that -for the easier problem when B is a matrix of all '+1's -there is no advantage to be gained by considering larger c × c-submatrices, or by considering a more general measurement µ (l). The choice of c = 1 is asymptotically optimal and, regardless of the choice of µ, the scores will only be indicative of J B and K B when m √ M (and won't be indicative of B when m = o( √ M ). In other words; algorithms built using the template above will always be subject to the same size-threshold of m √ M . Therefore, we conclude that our original problem also suffers from the same size-threshold limitation; regardless of our choice of c or µ, the template above won't be able to construct scores which indicate the rows and columns of B when m = o( √ M ).
To summarize: we have argued that the size-transition exhibited by our loop-scores will also be exhibited by a rather large class of similar scores. To be more specific, we conclude that -for fixed ε √ m -all scores constructed using the template above will be informative only when m √ M , and will not be informative when m = o( √ M ). This then implies that -when ε √ m = 0 -our loop-scores are asymptotically optimal (within the constraints of the template above). Therefore, one can imagine that the rows and columns which our algorithm focuses on will be those that are most highly correlated with v and u, respectively. These rows and columns should thus correspond to high-magnitude components within the vectors u and v, respectively. Given these observations, it is natural to consider the following simple spectral-biclustering method:
Step 1 Calculate the singular-value-decomposition D = U ΣV ⊺ . Set u and v to be the first columns of U and V .
Step 2 Set [Z ROW ] j = u 2 j and [Z COL ] k = v 2 k to be the entrywise-squares of the singular-vectors u and v.
Step 3 Use Z ROW and Z COL to produce a ranked list of rows and columns of D.
Another way to think about the connection between our loop-counting algorithm and this spectral-biclustering method is to first consider the bigraph corresponding to the binarized version of D. Within this context, we compute the rowloop-score by considering each (1 + 1) × 2 = 4-step path (j, k) → (j ′ , k) → (j ′ , k ′ ) → (j, k ′ ) through the bigraph associated with D (see Fig 89). Let us denote such a path by p [J; K], indexed by ordered row-subsets J = {j, j ′ } and column-subsets K = {k, k ′ } (here j is fixed, but j ′ , k and k ′ are arbitrary). Note that this path begins and ends at the fixed row j (i.e., this path is a loop) . For each p [J; K] we calculate the product of the matrix entries along that path: where j [s] corresponds to the j-index with s 'primes' atop it, adopting the convention that j [2] = j ′′ = j. As expected from our loop-scores, these simple (1 + 1) × 2-step paths produce a product of the form: Note that the path p [J; K] steps on each row and each column an even number of times. Consequently, if p [J; K] were fully contained within a bicluster that was exactly rank-1, then π would always equal +1. More generally, within the context of the planted-bicluster problem, the product π (p [J; K]) is more likely to be +1 if the path p [J; K] is fully contained within a planted bicluster. On the other hand, we expect that π (p) will be equally likely to be +1 or −1 when the path p is not fully contained within a planted bicluster. Once we have the product π (p) for each path p [J; K], we aggregate the results to form the score for row-j, which we'll denote by Z 1 ROW j := [Z ROW ] j : [  Figure 89: Illustration of paths through a bigraph. Our original row-scores Z 1 ROW j are computed by considering all (1 + 1) × 2 = 4-step paths through the bigraph of D which both start and end at row-j (see top-left). If such a 4-step path lies entirely within a rank-1 bicluster it will satisfy the following property: the product of the matrix-entries along the path will be equal to +1. The accumulation of such a product over all possible paths is given by We may also consider other paths through the bigraph which start and end at row-j. On the top-right we show one such family of paths: namely (2 + 1) × 2 = 6-step paths. One can easily see that these 6-step paths also have the same property as the 4-step paths above: namely, the product of the matrix-entries along the path will be equal to +1 if the path lies entirely within a rank-1 bicluster. The accumulation of these products is given by (DD ⊺ ) 2+1 jj . On the bottom we generalize this notion to (d + 1) × 2-step paths. Again, if such a path is contained within a rank-1 bicluster, the product of its entries will be +1. The accumulation of these products is given by (DD ⊺ ) d+1 We can generalize this approach by considering longer (d + 1)×2-step paths, rather than simple (1 + 1)×2-step paths. Some examples are illustrated in Fig 89. These paths involve larger row-subsets J = j, j ′ , . . . , j [d] and K = k, k ′ , . . . , k [d] .
The product π (p) is given by: with the convention that j [d+1] = j. As before, within the context of the planted-bicluster problem, π (p) is more likely to be +1 whenever p is fully contained within a planted bicluster; if p is not contained within a planted bicluster then π(p) is more likely to be drawn from the uniform distribution on {−1, +1} (we'll quantify this more precisely below). The accumulation of such products, which we'll denote by Z d ROW j is now given by: As d increases, the d-score Z d ROW converges in direction to the entrywise-square of the dominant left-singular-vector: i.e., u .2 . Similarly, the d-score Z d COL converges in direction to the entrywise-square of the dominant right-singular vector: i.e., v .2 . These scores are identical to the scores used in Step-2 of the simple spectral algorithm.
In the next two sections we argue that the size-threshold and noise-threshold associated with this simple spectral method can't be much better than the detection-thresholds associated with our loop-counting algorithm. Later, in section 22.3, we refine this argument to examine the behavior of the spectral method near the detection-threshold, explaining some of the results shown in Figs 29,30 and 31.

Size-threshold:
At this point we can use a simple counting argument to estimate how effective these scores Z d ROW might be when it comes to detecting a planted bicluster. This counting argument is the first stage of the commonly-used 'moment-method' for analyzing the distribution of eigenvalues in a random matrix [15]. While we strongly encourage the interested reader to pursue this reference, we provide the basics of this argument here.
For now let's focus on the noiseless limit, and say that the m × m planted bicluster B is perfectly rank-1, and involves the row-subset J B and column-subset K B . Let's refer to the matrix-entries of D that are not in B as B ∁ . We'll also assume that d is fixed and that M and m are very large, with M ≫ m ≫ d; we'll pay attention to the change in behavior that occurs when m ∼ √ M . When we refer to quantity such as O M 2d+1 we'll ignore terms that do not grow like a power of m or M (e.g., we'll ignore terms of order d, e d , and so forth). To ease the notation, let's define δ = (d − 1) /2; we'll later use δ to describe paths outside of B that contribute a π (p) of +1.
Any (d + 1) × 2-step path p [J; K] will step on up to 2 (d + 1) different matrix-entries within D. For a given path p, some of its matrix-entries may be traversed only once, but others may be stepped on multiple times (i.e., they may be traversed with multiplicity greater than 1). Paths p fall into one of several categories, depending on the unique matrixentries in p and their multiplicities. The essential idea behind our argument will be that, in order for a path p to contribute positively to the row-score, its matrix-entries from B ∁ must be multiplicity-2, but its matrix-entries from B need only be multiplicity-1.
Category-D0: This category includes paths p that start at some / ∈ J B and remain outside of B, while also having some matrix-entries with odd multiplicity. These paths will have a product π (p [J; K]) that is drawn from the uniform distribution on {−1, +1}. Given a fixed / ∈ J B , there are O M d · M d+1 such paths. We arrive at this estimate by noting that there are ∼ M choices for each of the d rows j [1] , . . . , j [d] , and ∼ M choices for each of the (d + 1) columns k [0] , k [1] , . . . , k [d] . These Category-D0 paths each contribute a term of type π (p) = ±1 to the score Z d ROW  . When averaging over all possible configurations of D, the total contribution of these Category-D0 paths will be 0. If these paths were all independent from one another, their total contribution would have a variance close to M d · M d+1 . Because the category-D0 paths are correlated, the variance of their total contribution is even larger, but is still is proportional to M d · M d+1 multiplied by a prefactor of the form [constant] d . The exact value of this prefactor is not as relevant as its magnitude, which grows with d but does not depend on M for fixed d. Thus, with respect to m and M , the variance of the total contribution of the Category-D0 paths is O M d · M d+1 .
Category-D: This category includes paths p that start at some / ∈ J B and remain outside of B, while only having matrix-entries with even multiplicity. These paths will have a product π (p) = +1. For a given / ∈ J B there will be O M δ · M δ+1 paths available. We arrive at this estimate by noting that each matrix-entry must be traversed an even number of times (e.g., not once, but 0, 2, 4, . . . times): there can be at most δ unique rows amongst j [1] , . . . , j [d] , and at most δ + 1 unique columns amongst k [0] , k [1] , . . . , k [d] . A path p with the maximum number of unique rows and columns thus has d + 1 = 2δ + 2 unique matrix-entries, starting out at , k [0] , including j [s+1] , k [s] and j [s+1] , k [s+1] for s ∈ 0, . . . , δ, and potentially terminating at , k [δ] . Such a path has multiplicity-2 for each of its matrix-entries. When averaging over all possible configurations of D, the total contribution of these Category-D paths will be O M δ · M δ+1 ; because δ ≈ d/2, this is bounded above in magnitude by the standard-deviation associated with the Category-D0 paths. Note that in this estimate we need not worry about other Category-D paths that involve fewer than δ unique rows and δ + 1 unique columns, because the quantity of these other paths is an order of magnitude lower than the multiplicity-2-paths just discussed.
Category-B: This category includes paths p that start at some j ∈ J B and remain inside B. These paths will necessarily step on each row and each column an even number of times (by construction), and will have a product π (p) = +1. Given a fixed j ∈ J B , there will be O m d · m d+1 such paths. We arrive at this estimate by noting that there are ∼ m choices for each of the d rows j [1] , . . . , j [d] , and ∼ m choices for each of the (d + 1) columns k [0] , k [1] , . . . , k [d] .
The total contribution of these Category-B paths will be O m d · m d+1 .
Category-C0: This category includes paths p that include some matrix-entries in B, as well as some matrix entries outside of B (i.e., in B ∁ ). Additionally, the paths p must each have at least one of the following properties: either (i) some of the matrix-entries outside of B have an odd multiplicity, and/or (ii) the collection of matrix-entries inside of B does not include each row in J B an even number of times, and/or (iii) the collection of matrix-entries inside of B does not include each column of K B an even number of times. These paths produce a product π (p) which is drawn uniformly from {−1, +1}. There are many such paths, but their total number is bounded by O m · M 2d if the path p starts at some given j ∈ J B and by O m 2 · M 2d−1 if the path starts at some given / ∈ J B , both of which are bounded by the number of Category-D0 paths. The reason there are so few of these paths is that at least one of the matrix-entries j [s] , k [s] or j [s+1] , k [s] must have a row drawn from J B and a column drawn from K B . Consequently, after averaging over all possible configurations of D, the total contribution of these paths to (i.e., d = 1), increases as a function of d up until a point, and then decreases as d gets larger and larger. The behavior of the trial-averaged Auc as a function of d is shown in Fig 90B (thick black line).
If we were to calculate our scores only once and use a linear threshold to distinguish rows inside B from rows outside B, then the Auc we just described would dictate the success of our methodology. However, our algorithm is iterative; we typically remove a small fraction of the rows and columns of D with each iteration. As mentioned in Fig 32, this fraction is usually 5%. Our hope is that -by removing the rows and columns associated with the lowest scores -we preferentially remove rows and columns from outside B, leaving the rows and columns of B mostly untouched. This strategy will be successful only if the lowest scores are more likely to come from outside B than from inside B. Thus, with regards to our iterative framework, a more revealing measure of the quality of the various d-score-distributions is the Auc obtained by comparing the bottom 5 th -percentile of row-scores in B with the bottom 5 th -percentile of row-scores outside B 8 . This trial-averaged metric -referred to as 'A05' -is shown in Fig 90C (thick black line).
Note that both the Auc and the A05 are markedly higher when d is small than when d is very large. This fact is a direct result of the lopsided nature of ζ d B for large d, and is the reason why our loop-counting algorithm outperforms the simple spectral method in this example. We now turn to a discussion of this lopsidedness.
Recall that the square-root-scores [z ∞ ROW ] associated with the simple spectral method are drawn from ζ ∞ B and ζ ∞ B ∁ , which describe the distribution of the coefficients of the vector | u|. The distribution of these coefficients can in turn be determined analytically by using the techniques of [M32].
Before we engage in this analysis, we make the following observations: 1. The singular-vector u is equal to the dominant eigenvector of the covariance-matrix DD ⊺ /M .
2. The covariance-matrix DD ⊺ /M can be approximated by the sum of (a) a random covariance matrix W and (b) a symmetric matrix C with the following block structure: C is mostly 0, except for an m × m block corresponding to J B with off-diagional entries roughly proportional to ±m/(2M max(1/2, l − 1)). See section 18 for a derivation.
In this particular example, because l = 2, the nonzero off-diagonal entries of C are close to ±m/(2M ).
3. The block matrix C is rank-1 with a dominant eigenvalue roughly equal to ∼ m 2 /(2M max(1/2, l − 1)). At this point we can adopt the approach of [M32], which observes that the dominant eigenvector u of DD ⊺ /M obeys an implicit equation involving the entries of ψ and u itself, as well as the empirical spectral distribution of W (which is, in this case, given by a Marchenko-Pastur law with aspect-ratio 1.0). Following this train of logic, one can show thatdepending on m -the vector u may or may not point in the direction of ψ. More specifically, if we define ω to be the angle between u and ψ, one can show that cos(ω) 2 can be approximated by: with this approximation holding with high probability in the limit as m, M → ∞ (with m/ √ M fixed). Note also that, aside from the restriction that u · ψ = cos(ω), the vector u is drawn from a uniform distribution on the sphere in R M .
We illustrate this phenomenon with a numerical experiment in Fig 90D. For this experiment we fix M = 768, l = 2 and ε = 0. We then vary m; for each m we construct multiple trials and measure ω. The trial-averaged cos(ω) 2 is plotted in black, while the analytical approximation to cos(ω) 2 from Eq. 3 is shown in grey.
The value of ω can be used to determine the distribution of u on R M : u is drawn uniformly from the set of vectors that are of unit norm and have ψ-component equal to cos(ω) (i.e., u is drawn from a slightly smaller sphere of dimension M − 1 centered at cos(ω) · ψ and normal to ψ). This distribution on u in turn determines the two distributions ζ ∞ B and ζ ∞ B ∁ .
In general, the distribution ζ ∞ B ∁ will be a half-gaussian; these square-root-scores are associated with the component of u that is randomly oriented. On the other hand, the distribution ζ ∞ B may or may not look like ζ ∞ B ∁ . If m is too low, then cos(ω) will be essentially 0, and the distribution ζ ∞ B will look essentially like ζ ∞ B ∁ . But, if m is large then cos(ω) will be large, and the distribution ζ ∞ B will be very different from ζ ∞ B ∁ . Coming back to our example, we have m = 39 and m/ √ M ≈ 1.4; the trial-averaged cos(ω) is approximately 0.19. If we use this value of ω to determine ζ ∞ B and ζ ∞ B ∁ as described above (i.e., drawing u from the appropriate sphere), we obtain the grey lines shown in the background of the spectral square-root-scores in Fig 90A. Note that this approximation is very close to the trial-averaged distributions shown in blue and red.
Taking a step back from our example we can immediately see that, if we were to increase m, then cos(ω) would increase as well, and u would align more closely with ψ. As u and ψ become more aligned, the distribution ζ ∞ B moves to the right, 8    as shown in Fig 91. When cos(ω) is sufficiently large, then ζ ∞ B is very well distinguished from ζ ∞ B ∁ , and the simple spectral method performs well.
At this point we can see that the value of cos(ω), and hence the performance of the spectral method, is a function of the bicluster size m, as well as the bicluster rank l and error ε. For many choices of parameters, such as those shown in the example above, cos(ω) will be small and the loop-scores will be more useful than the spectral-scores (at least within the context of our methodology). This phenomenon is particularly pronounced when the rank l of the planted-bicluster is greater than 1; in this situation cos(ω) increases only gradually as a function of m/ √ M . On the other hand, when l = 1 then cos(ω) increases rather quickly as a function of m/ √ M , and this phenomenon is not as pronounced. Indeed, for some choices of parameters (e.g., if l = 1 and m and ε are sufficiently large) then it is possible for the spectral-scores to be more useful than our loop-scores (see, e.g., label 'E' near the detection boundary in Fig 29C).
Overall, we believe that the loop-scores are a better choice than the spectral-scores. Not only are the loop-scores typically more useful with regards to the planted-bicluster problem (as discussed above), but the loop-scores are also less sensitive to certain kinds of spectral noise.
We close by highlighting two more planted-bicluster problems identical to the example discussed above, but with different values for m and l. The first involves m = 28, l = 1, and the second involves m = 64, l = 4. The behavior of the distributions ζ ∞ B and ζ ∞ B ∁ for these problems is qualitatively similar to the behavior shown in Fig 90A, but the separation between the two distributions depends on the details. The trial-averaged Auc and A05 for these two problems is shown in Figs 90B,C using magenta and green lines in the background. Note that the general trend observed in the m = 39, l = 2 case still holds: the spectral scores are less useful than the loop-scores. Also note that the peak values for the Auc and the A05 occur at different values of d, depending on the problem parameters.

A remark regarding message-passing algorithms
As readily observed in the examples above, the difference between the two distributions ζ d B and ζ d B ∁ depends on d. We've quantified this difference using the metrics Auc and A05 (which are easy to measure in a numerical experiment), but one could imagine a more comprehensive measurement of the information D KL (i.e., relative-entropy) between these two distributions. Given what we've seen so far, it seems likely that the 'best' d-score for any particular planted-bicluster problem is neither the loop-score (i.e., d = 1) nor the spectral-score (i.e., d = ∞), but rather somewhere in the middle.
Taking this reasoning a step further, it is tempting to try and construct some nonlinear function of the 'best' d-score which allows for the 'optimal' classification of J B versus J B ∁ . Going even further, one might imagine a score that is itself constructed nonlinearly; instead of building the d-score Z d ROW by applying power-iteration to the covariance-matrix DD ⊺ , one might try and build an even better score by applying a nonlinearity in between each stage of the power-iteration.
The message-passing algorithms of [M33], [M19], and [M41] proceed along these lines. Generally speaking, these algorithms choose an appropriate nonlinearity to apply between stages of a 'message-passing procedure' similar to poweriteration. By choosing this nonlinearity carefully, these methods can significantly reduce their detection-thresholds for a variety of problems very similar to our planted-bicluster problem (such as, e.g., the planted-clique problem). Given the success of the message-passing algorithms of [M33], [M19], and [M41], it seems certain that there exists a message-passing algorithm that outperforms our loop-counting algorithm when applied to the planted-bicluster problem; we fully intend to pursue this line of research in the future.
We remark that, when developing a message-passing algorithm, it is critical to ensure that the algorithm performs robustly even outside the framework of the planted-bicluster problem. That is to say, it is important to ensure that the message-passing procedure (and the choice of nonlinearities) is not too specific to the assumptions of the planted-bicluster problem, and that the algorithm generalizes to the kinds of scenarios seen in real data sets.
For example, we've already seen that the 'best' d-score for the planted-bicluster problem depends strongly on the problem parameters. More specifically, the 'best' d depends not only on the size and spectrum of the planted bicluster clear-cut. For example, we might imagine that our data includes N measurements taken across M total patients, each of