A loop-counting method for covariate-corrected low-rank biclustering of gene-expression and genome-wide association study data

doi:10.1371/journal.pcbi.1006105

Fig 1.

A highly idealized cartoon of different kinds of biclusters.

In each panel we show a heat-map of an M × N matrix ‘D’, which contains a large embedded bicluster (highlighted in pink) with a special structure. In this cartoon, light and dark pixels correspond to high and low values for the corresponding matrix-entry. Many approaches to biclustering search for structures containing mostly ‘large’ or ‘small’ values—as shown in Panels A and B. Such a bicluster can be thought of as delineating a subset of columns which are ‘differentially-expressed’ with respect to the remaining rows of D. Our algorithm generalizes this notion, searching for biclusters that are ‘low-rank’. Examples of low-rank biclusters include those shown in Panels A and B, as well as ‘rank-1’ biclusters which can exhibit co-expression without necessarily exhibiting differential-expression (see Panel-C and Fig 5 later on). Also encompassed are ‘rank-2’ and higher biclusters which exhibit higher-order correlations that are not necessarily obvious to the eye (see Panel-D and Fig 7 later on). Note that, while the biclusters shown in this cartoon are very large and essentially noiseless, our algorithm can readily discover biclusters that are much smaller and noisier (see section of S1 Text).

More »

Expand

Fig 2.

Illustration of the algorithm operating on a case-matrix alone (i.e., D only).

In Panel-A we show a large M × N binarized matrix D (black and white pixels correspond to values of ±1, respectively). In the upper left corner of D we’ve inserted a large rank-1 bicluster B (shaded in pink). Our algorithm considers all 2 × 2 submatrices (i.e., ‘loops’) within D. Several such loops are highlighted via the blue rectangles (the corners of each rectangle pick out a 2 × 2 submatrix). Generally speaking, loops are equally likely to be rank-1 or rank-2. Some loops, such as the loop shown in red, are entirely contained within B. These loops are more likely to be rank-1 than rank-2. In Panel-B we show some examples of rank-2 and rank-1 loops. Given a loop with row-indices j, j′ and column-indices k, k′, the rank of the loop is determined by the sign of . Our algorithm accumulates a ‘loop-score’ for each row j and each column k. In its simplest form, the loop-score for a particular row j is given by . Analogously, the loop-score for a column k is given by . In Panel-C we show the distribution of loop-scores we might expect from the rows or columns within D. The blue-curve corresponds to the distribution of scores expected from the rows/cols of D that are not in B, whereas the red-curve corresponds to the distribution of scores expected from the rows/cols of B. In Panel-D we show the distribution of loop-scores we might expect by pooling all rows or columns of D. The rows or columns that correspond to the lowest scores are not likely to be part of B.

More »

Expand

Fig 3.

Performance of loop-scores vs spectral-biclustering applied to the planted-bicluster problem.

For each instantiation of the planted-bicluster problem we choose an M, m, ε and l; we use these parameters to generate a random M × M matrix D and embedded m × m rank-l submatrix B with spectral noise ε. For each instantiation, our algorithm produces a list of row- and column-indices of D in the order in which they are eliminated; those rows and columns retained the longest are expected to be members of B. To assess the success of our algorithm we calculate the auc A_R (i.e., area under the receiver operator characteristic curve) associated with the row-indices of B with respect to the output list from our algorithm. The value A_R is equal to the probability that: given a randomly chosen row from B as well as a randomly chosen row from outside of B, our algorithm eliminates the latter before the former (i.e., the latter is lower on our list than the former); We calculate the auc A_C for the columns similarly. Finally, we use A = (A_R + A_C)/2 as a metric of success; values of A near 1 mean that the rows and columns of B were filtered to the top by our algorithm, whereas values of A near 0.5 mean that our algorithm failed to detect B. In the top of Panel-A we show the trial-averaged auc A for our loop-counting method as a function of and log_M (m). Results for l = 1 are shown on the left; l = 2 is shown on the right. Each subplot takes the form of a heatmap, with each pixel showing the value of A for a given value of and log_M (m) (averaged over at least 128 trials). The different subplots correspond to different values for M. Note that our loop-counting algorithm is generally successful when and . In the bottom of Panel-A we show the analogous auc A for a simple implementation of the spectral method (see section of S2 Text). In Panel-B we show the difference in trial-averaged A between these two methods (see colorbar for scale). Note that when l ≥ 2 or the noise is small, our loop-score generally has a higher rate of success than the spectral method. On the other hand, there do exist parameters when l = 1 and where the spectral method has a higher rate of success. In each panel the thin grey line shows the detection-boundary for our loop-counting method (calculated using ).

More »

Expand

Fig 4.

Illustration of the GSE48091 gene-expression data-set used in Example-A (see main text).

Each row corresponds to a patient, and each column to a ‘gene’ (i.e., gene-expression measurement): the color of each pixel codes for the intensity of a particular measurement of a particular patient (see colorbar to the bottom).M_D = 340 of these patients are cases, the other M_X = 166 are controls; we group the former into the case-matrix ‘D’, and the latter into the control-matrix ‘X’.

More »

Expand

Fig 5.

Illustration of bicluster found within gene-expression data-set.

Both panels illustrate the same submatrix (i.e., bicluster) drawn from the full case-matrix shown at the top of Fig 4. This bicluster was found using our control-corrected biclustering algorithm (described in section of S1 Text). In Panel-A we represent this bicluster using the row- and column-ordering given by the output of our algorithm. This ordering has certain advantages (see section of S2 Text), but does not make the co-expression pattern particularly clear to the eye. Thus, to show this co-expression more clearly, we present the bicluster again in Panel-B, except this time with the rows and columns rearranged so that the coefficients of the first principal-component-vector change monotonically. As can be seen, there is a striking pattern of correlation across the 793 genes for the 45 cases shown.

More »

Expand

Fig 6.

Contrasting a bicluster with controls.

This shows the bicluster of Fig 5B on top, and the rest of the controls on the bottom. The control-patients have been rearranged in order of their correlation with the co-expression pattern of the bicluster. Even though a few of the controls (i.e,. ∼ 3/166) exhibit a coexpression pattern comparable to that expressed by the bicluster, the vast majority do not.

More »

Expand

Fig 7.

Illustration of bicluster found within genome-wide-association-study dataset.

In this figure we illustrate the genome-wide association-study (i.e., GWAS) data-set discussed in Example-B (see main text). This data-set involves 16577 patients, each genotyped across 276768 genetic base-pair-locations (i.e., alleles). Many of these patients have a particular psychological disorder, while the remainder do not. We use this phenotype to separate the patients into M_D = 9752 cases and M_X = 6825 controls. The size of this GWAS data-set is indicated in the background of this picture, and dwarfs the size of the gene-expression data-set used in Example-A (inset for comparison). At the top of the foreground we illustrate an m = 115 by n = 706 submatrix found within the case-patients. This submatrix is a low-rank bicluster, and the alleles are strongly correlated across these particular case-patients. The order of the patients and alleles within this submatrix has been chosen to emphasize this correlation. For comparison, we pull out a few other randomly-chosen case-patients and control-patients, and present their associated submatrices (defined using the same 706 alleles) further down.

More »

Expand

Fig 8.

Continuous–covariate-distribution for the bicluster shown in Example-B.

As mentioned in the introduction, our algorithm proceeds iteratively, removing rows and columns from the case-matrix until there are none left. One of our goals is to ensure that, during this process, our algorithm focuses on biclusters which involve case-patients that are relatively well balanced in covariate-space. On the left we show a scatterplot illustrating the 2-dimensional distribution of covariate-components across the remaining m = 115 case-patients within the bicluster shown in Example-B (i.e., Fig 7). The horizontal and vertical lines in each subplot indicate the medians of the components of the covariate-distribution. On the right we show the same data again, except in contour form (note colorbar). The continuous-covariates remain relatively well-distributed even though relatively few case-patients are left (compare with Fig 9).

More »

Expand

Fig 9.

Continuous–covariate-distribution from Example-B as the loop-counting algorithm proceeds.

On top we show several scatterplots, sampling from different iterations as our algorithm proceeds. Each scatterplot illustrates the 2-dimensional distribution of covariate-components across the remaining case-patients at that point in the iteration. The horizontal and vertical lines in each subplot indicate the medians of the components of the covariate-distribution. Below we show the same data again, except in contour form (note colorbar). Note that the covariate-distribution remains relatively well-distributed as the algorithm proceeds.

More »

Expand

Fig 10.

Row-traces for the bicluster shown in Example-A.

This bicluster was found by running our algorithm on the data shown in Fig 4. Because we corrected for controls, we compare our original-data to the distribution we obtain under the null-hypothesis H0 (see Methods). On the left we show the row-trace as a function of iteration for the original-data (red) as well as each of the 256 random shuffles (blue). On the right we replot this same trace data, showing the 5th, 50th and 95th percentile (across iterations) of the H0 distribution. Because we are not correcting for any covariates, the column-traces are identical to the row-traces.

More »

Expand

Fig 11.

A scatterplot of the data shown in Fig 10.

Each row-trace shown on the left in Fig 10 is plotted as a single point in 2-dimensional space; the horizontal-axis corresponds to the maximum row-trace and the vertical-axis corresponds to the average row-trace (taken across the iterations). The original-data is indicated with a ‘⊗’, and each of the random shuffles with a colored ‘•’. The p-value for any point in this plane is equal to the fraction of label-shuffled-traces that have either an x-position larger than x_w or a y-position larger than y_w, where x_w and y_w are the x- and y-percentiles associated with the most extreme coordinate of (details given in section ) of S2 Text. Each random shuffle is colored by its p-value determined by the label-shuffled-distribution. By comparing the original-trace with the shuffled-distribution we can read off a p-value for the original-data of ≲ 0.008.

More »

Expand

Fig 12.

Illustration of the loops within a 3-dimensional array.

We sketch the structure of a 3-dimensional data-array D, with J rows, K columns and P ‘layers’. Each entry D_j,k,l will lie in the cube shown. The loops within D can be divided into 3-categories: (a) iso-layer loops that stretch across 2 rows and 2 columns, (b) iso-column loops that stretch across 2 rows and 2 layers, and (c) iso-row loops that stretch across 2 columns and 2 layers. The row-score [Z_ROW]_j aggregates all the iso-column and iso-layer loops associated with row-j. The column-score [Z_COL]_k aggregates all the iso-row and iso-layer loops associated with column-k. The layer-score [Z_LYR]_l aggregates all the iso-row and iso-column loops associated with layer-l.

More »

Expand