Combinatorial Pattern Discovery Approach for the Folding Trajectory Analysis of a β-Hairpin

The study of protein folding mechanisms continues to be one of the most challenging problems in computational biology. Currently, the protein folding mechanism is often characterized by calculating the free energy landscape versus various reaction coordinates, such as the fraction of native contacts, the radius of gyration, RMSD from the native structure, and so on. In this paper, we present a combinatorial pattern discovery approach toward understanding the global state changes during the folding process. This is a first step toward an unsupervised (and perhaps eventually automated) approach toward identification of global states. The approach is based on computing biclusters (or patterned clusters)—each cluster is a combination of various reaction coordinates, and its signature pattern facilitates the computation of the Z-score for the cluster. For this discovery process, we present an algorithm of time complexity c∈RO((N + nm) log n), where N is the size of the output patterns and (n × m) is the size of the input with n time frames and m reaction coordinates. To date, this is the best time complexity for this problem. We next apply this to a β-hairpin folding trajectory and demonstrate that this approach extracts crucial information about protein folding intermediate states and mechanism. We make three observations about the approach: (1) The method recovers states previously obtained by visually analyzing free energy surfaces. (2) It also succeeds in extracting meaningful patterns and structures that had been overlooked in previous works, which provides a better understanding of the folding mechanism of the β-hairpin. These new patterns also interconnect various states in existing free energy surfaces versus different reaction coordinates. (3) The approach does not require calculating the free energy values, yet it offers an analysis comparable to, and sometimes better than, the methods that use free energy landscapes, thus validating the choice of reaction coordinates. (An abstract version of this work was presented at the 2005 Asia Pacific Bioinformatics Conference [1].)


Introduction
Understanding protein folding is one of the most challenging problems in molecular biology [2][3][4][5][6][7]. The interest is not only in obtaining the final fold (generally referred to as structure prediction) [8][9][10] but also in understanding the folding mechanism and folding kinetics involved in the actual folding process. Many native proteins fold into unique globular structures on a very short time scale. The so-called fast folders can fold into the functional structure from random coil in microseconds to milliseconds. Recent advances in experimental techniques that probe proteins at different stages during the folding process have shed light on the nature of the folding kinetics and thermodynamics [11][12][13][14][15][16][17]. However, due to experimental limitations, detailed protein folding pathways remain unknown. Computer simulations performed at various levels of complexity, ranging from simple lattice models to all-atom models with explicit solvent, can be used to supplement experiment and fill in some of the gaps in our knowledge about folding mechanisms.
Large-scale simulations about protein folding with realistic all-atom models still remain a great challenge [3][4][5]7]. Enormous effort is needed for this grand problem; one example is the recent IBM Blue Gene project, which is aimed at building a supercomputer with hundreds-of-teraflop to petaflop computing power to tackle the protein folding problem. Meanwhile, effective analyses of the trajectory data from the protein folding simulations, either by molecular dynamics or Monte Carlo, remains yet another challenge due to the large number of degrees of freedom and the huge amount of trajectory data. [18,19] Currently, the protein folding mechanism is often characterized by calculating the free energy landscape versus the so-called reaction coordinates [3,20,21]. We and others have used various reaction coordinates [3,20,21], such as the fraction of native contacts, the radius of gyration of the entire protein, the root mean square deviation (RMSD) from the native structure, the number of b-strand hydrogen bonds, the number of a-helix turns, the hydrophobic core radius of gyration, and the principal components (PC) from principal component analysis [20,22]. Searching for better reaction coordinates is still of great interest in protein folding mechanism studies. These analyses have provided important information for a better understanding of protein folding. However, it often requires a priori knowledge about the system under study, and the free energy contour maps usually result in too much information reduction due to their limit in dimensionality, which is often as low as two or three. Thus, better or complementary analysis tools are in great demand.
It is also known that the folding process of many proteins takes the amino acid coil through different intermediate states before stabilizing on the final folded state. Therefore, a first step toward understanding the folding process is to identify these states. In this paper, we propose the use of a combinatorial pattern discovery technique to analyze protein folding trajectory data from simulation experiments. A novel aspect of the current algorithm is that it incorporates arbitrary and possibly different distribution functions of the data in each dimension and guarantees complete and accurate solution to the clustering problem. The procedure involves computations of clusters of the data: each cluster has a signature pattern describing all the elements of the cluster. The simplicity of the pattern leads to easy interpretation of and thus better understanding of the underlying processes and facilitates the computation of a Z-score for the cluster. By appropriate redundancy checks, the number of clusters is made manageably small. The results of this method are threefold. Firstly, the method is validated by comparing its results with previously published results with a free energy landscape analysis. Secondly, the method succeeds in extracting meaningful new patterns and structures that had been overlooked before. These new structures provide a better understanding of the folding mechanism of a b-hairpin, which is used as a case study in this paper. These new patterns also interconnect various states in existing free energy contour maps versus different reaction coordinates. This success encourages us to postulate that the automatic discovery will lead to much greater understanding of the folding process. Thirdly, the method validates the choice of reaction coordinates since the pattern discovery analysis based on these reaction coordinates compares well with the previous free energy based approaches.

Description of Models
Well-known simulation methods exist to carry out the folding of a protein. However, it is often not sufficient to obtain a succinct understanding of the folding process. The task here is to understand the folding mechanism by recognizing structural patterns or intermediate states that the folding process goes through. For example, the folding of a small protein, a b-hairpin, could be understood at a global level in terms of the states shown in Figure 1. Although we would aim to understand the folding of every protein in this simplistic form, the current state of the art is far from this goal.
At each step of the simulation process, a configuration of the solvated protein can be computed. However, the simulation may be carried for nanoseconds to microseconds in units of femtoseconds (10 À15 ), so the number of such intermediate configurations could easily be millions in number. Hence, the task is to identify and capture representative intermediate configurations. Since working in the structure space of the protein is extremely complex, researchers often identify a few key characteristic features of the protein, or often so-called reaction coordinates, and study the trends and variations in these reaction coordinates [21,23].
In this paper, we utilize a four-step process toward understanding the folding of a protein ( Figure 2). The first step involves the in silico simulation that gives rise to a large collection of data points, each point being an array of the characteristic features of the folding protein at that time point. For example, the radius of gyration or the number of hydrogen bonds could be such features. In the Results/ Discussion section, we study the b-hairpin folding as a show case and describe seven such characteristic features that we have used previously in the study of this particular protein.
In the second step, we study these data points to extract the characteristic set of features that we call pattern clusters. Again, in the case of the b-hairpin, the data points are sevendimensional, corresponding to the characteristic features of the protein at each time interval (see Table 1 for a small portion of the data as an example). In the third step, these patterns are filtered to retain the most significant ones. It is very difficult to model the significant patterns in this domain, so we have combined the second and third steps and use appropriate parameters to filter out possibly insignificant patterns: we use cluster size (in terms of rows) and the Z-scores.

Synopsis
The study of protein folding mechanisms continues to be one of the most challenging problems in computational biology. Currently, the protein folding mechanism is often characterized by calculating the free energy landscape versus various reaction coordinates, such as the fraction of native contacts, the radius of gyration, RMSD from the native structure, and so on. In this paper, the authors present a combinatorial pattern discovery approach toward understanding the global state changes during the folding process. This is a first step toward an unsupervised (and perhaps eventually automated) approach toward identification of global states. The authors apply this approach to a b-hairpin folding trajectory and demonstrate that this approach extracts crucial information about protein folding intermediate states and mechanism.
The fourth step is to analyze the patterns. This involves extracting the structure of the configuration using the time coordinates and studying the correlation of the different structures. For instance, one could observe that the hydrophobic core is formed before the b-strand hydrogen bonds, or vice versa; and one can interconnect various free energy states in different free energy surfaces by monitoring the high-dimensional (multi-column) patterns. These findings can provide a better understanding of the protein folding mechanism. Further, the time correlation between various patterns or states could be studied. For example, it is extremely useful to know which pattern or state precedes the other and by how much time.
Here, we describe in detail the second and third steps in our approach, as shown in Figure 2. We model the extraction problem as a combinatorial detection problem for at least three specific reasons: (1) The data are obtained from a replica exchange molecular dynamics (REMD) method [24] (more details below). This method is essentially a Monte Carlo method; thus, the time series is not strictly real time due to the random Monte Carlo exchange process. Also, our interest is in finding pattern clusters that are not necessarily correlated in time. (2) This emphasizes that any probabilistic (or non-deterministic) component can be isolated from the algorithm and the problem. Any high-frequency noise can be largely resolved through an introduction of a d function (see below). (3) The signature pattern of the cluster helps interpret the clusters quite easily. Also, in comparison to the straightforward grouping or clustering algorithms in previous publications [21,25], this provides a complete and efficient (in linear time) method to find the signature patterns. It must be pointed out that this is the critical reason why we chose to use this method, since this enables us to have a tighter control on an acceptable cluster that is also meaningful in terms of the folding process.
A small but important protein system has been selected as an example to demonstrate our approach to understanding the folding process. This small protein is a 16-residue b-hairpin (GEWTYDDATKTFTVTE) from the C-terminus of protein G (residues 41-56 of PDB file 2gb1.pdb). Its folding mechanism and folding free energy states have been studied extensively in previous works [21,23]. The current study will use our new approach to analyzing the existing trajectories from the previous REMD simulations in explicit solvent [21,24]. The REMD method couples molecular dynamics trajectories with a temperature-exchange Monte Carlo process for efficient sampling of the conformational space. In this method, replicas are run in parallel at a sequence of temperatures ranging from the desired temperature to a high temperature at which the replica can easily surmount the energy barriers. From time to time, the configurations of neighboring replicas are exchanged and this exchange is accepted by a Metropolis acceptance criterion that guarantees the detailed balance. Because the high-temperature replica can traverse highenergy barriers, this provides a mechanism for the lowtemperature replicas to overcome the quasi-ergodicity they would otherwise encounter in a single-temperature replica.
This b-hairpin has received much attention recently from both experimental and theoretical fronts [11,13,14,18,20,[26][27][28][29][30]. The b-sheets and a-helices are the key secondary structures in proteins. It is believed that understanding the folding of these elements will be a foundation for investigating larger and more complex structures. The study of isolated b-sheets has for a long time been limited by the lack of an amenable experimental system. The breakthrough experiments by Serrano [11] and Eaton [13] groups have recently Table 1. A Small Portion of the Raw Data from the REMD Sampling of the b-Hairpin Folding in Explicit Water For simplicity, we refer to J1, J2,. . .,J7 rather than the detailed reaction coordinate names in the following tables and text.
Step 2 extracts the recurring patterns, reducing the size of the data to be studied down to thousands.
Step 3 further reduces down this to a representative set of a handful states, which are studied in detail in Step 4. The structures are extracted, and a possible state diagram summarizing the path of the folding protein is elucidated. DOI: 10.1371/journal.pcbi.0010008.g002 established this b-hairpin as the system of choice to study bsheets in isolation. These pioneering experiments inspired a number of theoretical works on this system with various models [18,20,21,26,27,31,32]. However, there are still a number of important aspects that remain controversial, such as the relative importance and time-sequential order between the b-strand hydrogen bonds formation and the hydrophobic core formation, and the existence of a-helical intermediates during the folding.

Simulation Parameters
In this study, an all-atom model-The Optimized Potential for Liquid Simulations-All-Atom force field [33] with an explicit solvent model, Simple Point Charge model [34]-is used for the description of the protein solvated in water. A total of 64 replicas of the solvated system consisting of 4,342 atoms is simulated with temperatures spanning from 270 K to 695 K. For each replica, a 3-nanosecond molecular dynamic simulation is run with replica exchanges attempted every 400 femtoseconds. The reader is directed to [21,23] for details of this simulation. For each conformation, seven different reaction coordinates are used (Table 1). There are a total of about 20,000 conformations saved for each replica. Table 1 lists a small portion of the data for the replica at 310 K, which is the biological temperature.
These simulations have revealed a hydrophobic-coredriven folding mechanism from free energy contour map analysis [21]. Since this is a well-studied system and a large amount of data is available, comparisons with other analysis tools, such as the free energy contour map analysis, might be easier and more straightforward. Various reaction coordinates obtained from previous runs serve as the starting point.

Discovery Parameters
Although we developed the framework for a very general d function, for simplicity, in this section we treat d(x) to be a constant function. Thus, d(x) ¼ c for some constant c 2 R for each x. The d functions for each column of Table 1 is given as follows: Further, the quorum k is defined to be 2,000. Table 2 lists some representative patterns of size two with these parameters. The time sequences are not shown due to the space constraints. These simple patterns can be directly compared with the previous free energy states in the three-dimensional free energy contour maps. These are three-dimensional plots of free energy versus a pair of reaction coordinates or data columns of Table 1.
One might often want to study detailed patterns or structures in some predefined subregions such as the structures in the unfolded ensemble. More evidence has shown that the protein structures in unfolded states are not fully extended but often have well-defined structures instead [35]. This can also avoid the problem that important patterns in these less populated areas are being overlooked due to a smaller population than the predefined quorum k. Thus, some less populated free energy states in free energy landscapes can be recovered by reducing the quorum. Hence, another set of parameters have been used, and here we confine our search to data points with N b HB ¼ 0:0 and R core g . 5.0 Å (see Table 1 for definitions of these reaction coordinates) with k ¼ 100. Yet another set of parameters have included N b HB ¼ 0:0 and R core g . 9.0 Å with k ¼ 50. A subset of the results is shown later. Thus, this approach might be useful for hierarchical pattern searches that gradually zoom into the predefined subsets of data.

Analysis of Results
To obtain a representative structure(s) from a set of configurations c i , the set is partitioned into a minimum number of groups G j such that for each G j there exists a representative c j i 2 G j , and for each c k 2 G j the structure corresponding to c k is at most 1 Å RMSD from c j i . Thus, each G j will be represented by a structure corresponding to c j i [21,26]. Recovering known free energy states. Obviously, the first question of importance is: Can we recover the previously found free energy states in the new approach? The ''time sequence'' of each pattern is then used to extract the corresponding conformations of the protein. Figure 3A shows a representative or most populated structure for the first pattern ( J 1 ðN b HB Þ¼4.886 6 0.2, J 2 ðR core g Þ¼5.448 6 0.6 ) in Table  2. This structure mimics the representative structure from the folded state (F state) in the free energy contour map versus N b HB and R core g very well. Thus this pattern resembles the F state of the free energy contour map. Similarly, the second pattern of Table 2 ( J 1 ðN b HB Þ ¼ 2.875 6 0.2, J 2 ðR core g Þ ¼ 5.448 6 0.6) resembles the partially folded state, P state, in the same free energy landscape. The structures for the two patterns are shown in Figure 3. Thus, our approach recovers the most populated states in the free energy landscape analysis.
The third and fourth patterns in Table 2 also resemble the F state and P state, respectively, in the same free energy contour map versus N b HB and R core g . Numerous other patterns have shown similar results, i.e., recovering various previously found free energy states in the free energy contour maps versus different reaction coordinates. It should be noted, though, that many patterns might be redundant, either because the d() function values given for reaction coordinates are too wide, or because some of the reaction coordinates are highly correlated. For example, the fifth pattern of Table 2 is R core g ¼ 4.979 6 0.6, R g ¼ 8.144 6 0.35. Clearly, these two reaction coordinates are highly correlated, since R core g measures the radius of gyration of four key residues out of the total 16 that are measured by R g . However, for many other cases, it may not be so obvious.
Interconnecting various free energy landscapes. More complicated patterns with many reaction coordinates are also found in the current approach, which had been previously undetected. In the traditional free energy landscape analysis, typically one or two reaction coordinates are used at each time, since a two-or three-dimensional free  (1) energy contour map is usually plotted. It is extremely difficult to visualize high-dimensional free energy landscapes in order to identify the free energy basins or barriers. Table 3 lists some of these complicated patterns with up to six reaction coordinates. Of course, as pointed out earlier, some reaction coordinates might be correlated, so the data in each reaction coordinate may not be totally independent. Nevertheless, it still reveals some interesting new findings. First of all, these patterns can interconnect various free energy states in different free energy landscapes. This might not be so obvious in free energy surfaces themselves. For example, the sixth pattern in Table 3 Figure 4A), and the other versus q and R g ( Figure 4B). The states corresponding to the free energy well (of value ' À8 KT) near PC-1 ¼ À5.9, PC-2 ¼ À33.6 in Figure 4A and q ¼ 0.82, R g ¼ 8.1 in Figure 4B are the same free energy state since they consist of the same clusters in the same pattern. In this particular case, obviously they all represent the folded state (F state).
Understanding folding mechanism better. More importantly, the new approach reveals important structures overlooked previously, which might help understand the folding mechanism better. Eaton and coworkers [13,14] proposed a ''hydrogen bond zipping'' mechanism for this b-hairpin, in which folding initiates at the turn and propagates toward the tails by making b-strand hydrogen bonds one by one, so that the hydrophobic core, from which most of the stabilization derives, forms relatively late during the folding. In our previous study, we proposed a different folding mechanism, in which this b-hairpin undergoes a hydrophobic core collapse first, then makes native b-strand hydrogen bonds to make over the free energy loss due to the loss of H-bonds between the backbone atoms and water. Figure 5A shows a representative structure for the eighth pattern in Table 3 . The structure shows that all five native b-strand H-bonds have been formed, but the hydrophobic core is not completely aligned yet. The loop region also bends toward the hydrophobic core to somewhat offset the non-perfect hydrophobic core. These structures with Hbonds that are formed but with their hydrophobic core not perfectly aligned (RMSDs up to 4 Å ) imply that the hairpin can also have a path to form b-strand hydrogen bonds before the core is finalized. The current findings indicate that the final hydrophobic core and b-strand hydrogen bonds might be formed almost simultaneously. This can also be seen from the low free energy barrier in free energy landscapes as discussed before [21]. Interestingly, Thirumalai et al. also found that the lag time between collapse and hydrogen bond formation is very short and the two processes occur nearly simultaneously [32]. It should be pointed out that the turn (loop) formation is critical in this b-hairpin folding mechanism, since the hydrophobic core and b-strand hydrogen bonds need to be packed or formed at right positions. Interestingly, this is also reported by other groups [15][16][17]. For example, Gai and coworkers studied a related b-hairpin, Trp-zipper hairpin, and found that the rate-limiting event corresponds to the turn formation [15,16]. Moreover, the authors pointed out that a stronger turn-promoting sequence increases the stability of the hairpin primarily by increasing its folding rate, whereas a stronger hydrophobic cluster increases the stability primarily by decreasing its unfolding rate [15,16]. Finally, the patterns of subsets of data in less populated states, such as the unfolded state, are studied in detail by zooming into these regions with a smaller quorum k and a different set of d(). As mentioned earlier, more evidence has shown that the protein structures in unfolded states are not fully extended, but often have well-defined structures instead [35]. The first pattern in Table 4

) resembles the previous H-state in free energy contour map versus N b
HB and R core g , where the hydrophobic core is largely formed but no native b-strand H-bonds have been made yet. Figure 5B shows a representative structure of this pattern, which mimics the structures from previous H-state very well. Figure 5C shows a representative structure for the sixth pattern in Table 4 . This is the most populated structure of this b-hairpin in unfolded state. Even though not many structural features are found in this structure, it is certainly not fully extended either. Since this is a very small protein with only one secondary structure in the native state, not much has been identified in the unfolded state; for larger and more complicated protein systems, such as lysozymes, more structural features might be expected in the unfolded state as found by recent experiments [35].   Table 3, which represents a new class of structures previously overlooked in free energy landscape analysis. (B) Pattern 1 of Table 4, which captures the H state (hydrophobic core formed but no b-strand H-bonds) in free energy contour map analysis [21]. (C) Pattern 2 in Table 2 captures the unfolded state (U state) in the same free energy contour map. The hydrophobic residues TRP43, TYR45, PHE52, and VAL54 are represented by spacefill, and the rest are represented by ribbons. DOI: 10.1371/journal.pcbi.0010008.g005

Conclusion
In this paper, we have presented a method to enhance our understanding of protein folding mechanisms. At the heart of this method is a combinatorial pattern-discovery algorithm that analyzes multi-dimensional data from the simulation of the protein folding trajectory. The approach is based on pattern computation, each pattern being defined by a cluster of the reaction coordinates. A small but important protein system, a b-hairpin from the C-terminus of protein G, is then used to demonstrate this approach. It is shown that the method not only reproduces the previously found free energy states in free energy contour maps, but also reveals new information overlooked previously in free energy landscape analysis about the intermediate structures and folding mechanism. It is also shown to be useful in making interconnections between various three-dimensional free energy surfaces versus different reaction coordinates and also explains the mechanism behind the folding process. The method also validates the choice of reaction coordinates as the analysis without using free energy values compares well with the ones that use them. The success with b-hairpin is very encouraging, and we are currently exploring the application of this method to other larger protein molecules.
As stated in the Introduction section, it is important to study the time correlation between various patterns or states. For example, it is extremely useful to know which pattern or state precedes the other and by how much time. Of course, this requires real-time trajectory data. The current study uses the previous trajectories of REMD, which is a Monte Carlo method; thus, the time sequence in the data points is not real time. After this method's success with the current data, we believe that we will be able to garner time correlation of the patterns, and we are currently investigating this.

Materials and Methods
We first define the problem at hand and then give a linear time algorithm to solve the problem. The number of clusters can be easily controlled by the use of an appropriate d() function (see below).
Combinatorial problem description. In this section, we describe the combinatorial problem. Here, we also make some simple observations that have quite useful and practical implications (such as linear number of d-clusters and so on). They also indicate to the extent different functions (such as the form of d()) can be relaxed without sacrificing the general framework presented in this section. A reader may skip the statements and the proofs of these observations without any loss of continuity. Definitions 1 and 2 identify the pattern discovery or the clustering problem used in this paper, and the Results/Discussion section describes an output-sensitive algorithm to discover them.
Although using a general d() function opens the possibility of various pre-processing of the data, it is important to identify a reasonable d() function. We impose the following condition on d(), calling it the constrained d function. Given any three data elements This is a reasonable condition on an acceptable d() function, as can be seen from the consequence of the imposed constraint in Lemma 1. A multitude of continuous functions satisfy this condition, and in the rest of the paper we will assume that d() function we use also satisfies this condition.
Thus, the containment of the intervals is as shown; hence, for each m i , m min , m i , m max , v i 2 d-cluster.
Lemma 2. The number of maximal d-clusters is no more than n where d() is constrained.
By Lemma 1, any d-cluster is an interval (contiguous elements on the sorted list) on the sorted list of data elements. We will show that any two intervals that correspond to two maximal d-clusters cannot be such that one is contained in the other. Assume the contrary that one is contained in the other. Clearly, by the definition of maximality, the smaller interval is not maximal, leading to a contradiction. As no interval is contained in the other, it is possible to assign a unique element on the sorted data elements to each interval. Thus, the number of intervals cannot exceed the number of data elements, hence the result. Corollary 1. If d(x) ¼ c for some c 2 R, then the number of d-clusters is no more than n.
The bicluster takes into account the different columns or features in the data: the natural definition of such a cluster is given below.
Definition 2. (bicluster, maximal bicluster) Given d j () : R ! R þ , quorum k and v ij 2 R, 1 j m, 1 i n. A bicluster is collection i and j with v ij 2 V c such that for each j, fv ij 2 V c j1 i ng is a d j -cluster. Further, V c is maximal if there exists no additional i9 or j9 with the corresponding V c with V c & V c V such that V c is a bicluster.
For ease of reference, the bicluster will be also called a pattern cluster since a cluster can be represented by the signature pattern (J 1 ¼ c 1 , J 2 ¼ c 2 ,..., J L ¼ c L ), where v iJk 2 V c , 1 k L. These J 1 , J 2 ,..., J L represent various reaction coordinates from the protein folding trajectory (shown in Table 1). This representation is more suitable for interpreting the results, as seen in other sections of this paper. The size of the bicluster is L, and k is the number occurrences or quorum of the cluster. Table 4. Clusters with (1) J 1 ¼ 0.0, J 2 ! 5.0, k ¼ 50 and (2) J 1 ¼ 0.0, J 2 ! 10.0, k ¼ 100 Lemma 3. The following are a consequence of the maximality constraint: (1) If a collection of i is such that v ij 2 V c where V c is a maximal d-cluster for some j, then there exist no other maximal dcluster V c neV c such that v ij 2 V c . (2) If a collection of j is such that v ij 2 V c where V c is a maximal d-cluster for some j, then there exist no other maximal d-cluster V c neV c such that v ij 2 V c .
Lemma 4. Given v ij 2 R, 1 j m, 1 i n. the number of maximal biclusters is no more than n 2 m.
In a maximal bicluster V c for some j, fv ij 2 V c g is not necessarily maximal. The number of such clusters by Lemma 2 can be no more than n 2 . By Lemma 3, this can belong to only one maximal bicluster. Thus, there can be no more than n 2 m maximal biclusters, since there are m columns.
The linear time algorithm. Similar descriptions of bicluster detection appear in [36], in which the authors present only an empirical time bound (linear with output size). G. Alexe and P.L. Hammer also present an incremental polynomial time algorithm with a total running time of O(Nnm 2 ) (personal communication). N is the number of patterns in the output, and (n 3 m) is the size of the input. In this section, we present an output-sensitive algorithm that computes all the maximal biclusters. The algorithm has two main steps. In the first step, the maximal d-clusters are computed, and in the second step, the maximal biclusters are computed using the clusters of the first step.
Step 1: Maximal d j -cluster computation. For each j, 1 j m, compute the maximal d j -cluster, V j l . For simplicity, let the number of these be L j and the clusters be V j l , 1 l L j and they are computed as described below. We present a simple algorithm that does a linear scan of the sorted entries v ij for each fixed j using two pointers i and l: i tracks the start of the cluster, and l tracks the end of the cluster. The end pointer is incremented until it is no longer a cluster satisfying the d() function, and only then the start pointer is incremented. The pseudocode, Compute-Cluster(), describes the maximal d-cluster computations, for each j. To avoid clutter, the end-of-input check is not included in the code.
Compute-Cluster() (1) Sort the m i 's to obtain m 1 , m 2 , . . ., m n Step (3) Next, for each m ij , 1 i n, 1 j m, a set of d-clusters v9 ij is computed as follows: v9 ij ¼ fV j ' jvij 2 V j ' ; 1 ' L j g: Step 2: Maximal bicluster computation. The algorithm in this step is based on the set intersection problem described previously [37] in the context of computing redundant motifs from irredundant ones. The algorithm works on v9 ij ,1 i n,1 j m, of the last step.
We describe a simple recursive algorithm to solve this problem. This algorithm implicitly constructs a tree in a depth-first manner where (1) each level corresponds to a distinct j, hence the height of the tree is m, and (2) each non-leaf node at level l corresponds to j ¼ (m À l) (the root at level 0 corresponds to (j ¼ m), and has at most (L j þ 1) children, the 'th child, 1 ' L j , corresponds to the d-cluster V j c' and the very last child ([L j þ 1]th child) ignores the V j ' d-clusters. The algorithm is efficient due to the two following factors: (1) use of a data structure (D in the pseudocode below) to store the maximal biclusters, so that searching for an arbitrary one can be done quickly, and (2) terminating the tree traversal appropriately. The data structure suggested for use is a tree so that each query takes log n time. The terminating condition (line [2.4] of the pseudocode) is such that each leaf node corresponds to either the maximal bicluster defined by the d-clusters (feature values)fV The pseudocode of the recursive routine Generate-Set() shown below, describes the algorithm. Assume a function Add-set (R,C), which inserts R, a subset of integers between one and n, in a tree data structure D, along with the accompanying set C: then a query of the form if a set R exists in D takes O(log n) time. The initial call is Generate-Set (f1,2,. . .,ng,/,m)).
Analysis of the algorithm. We first show that the algorithm is correct in computing all the maximal biclusters and next show that the algorithm runs in time linear with the size of the output.
Correctness of the Algorithm. We first show that each computed (R,C) is a bicluster. By the construction, for each j, fv ij ji 2 Rg is a dcluster. Thus (R,C) is a bicluster. Next, we have to show that it is maximal. Assume it is not. Then there exists v ij such that V k ¼ R [ fv i9j jv ij 2 R for some jg is a bicluster. Hence for each j, fv ij jv ij 2 R9g is a d-cluster. Then in the subroutine call Generate-Set (R,C,i9) of the pseudocode , this set must have been created, leading to a contradiction. Hence, the assumption is wrong.
Next, assume there exists v ij 9 such that R9 ¼ R [ fv ij 9 jv ij 2 R for some ig is a bicluster. Hence for each j, fv ij jv ij 2 R9g is a dcluster. Then in Step 3 of the subroutine call Generate-Set(R,C,i), V d corresponding to j9 must have been included, leading to a contradiction. Hence, the assumption is wrong. Thus, all the computed sets are maximal biclusters. By similar arguments, it is easy to see that if there is any maximal bicluster defined on the data set, it must one of the computed R's.
Complexity of the Algorithm. Assume the input elements are m ij , 1 i n, 1 j m. Consider the first step of computing the d-clusters for each j. The sorting of the elements m i , 1 I n takes O(n log n) time. The algorithm works by scanning the input from left to right, say i to I þ s, where the set fm i , m iþ1 ,. . ., m iþs g is a maximal d-cluster. Then the input is scanned from i þ 1, i þ s þ 1, i þ s þ 2,. . . onwards and so on. Thus, each data element is visited no more than twice. Assuming the comparison can be made in constant time, this step of the algorithm takes O(n log n þ n) ¼ O(n log n) time for each j.
Next, consider the second step of computing the maximal biclusters. Notice that the search in Step (2.4) of the subroutine Generate-Set can be done in log n time. In the recursive-call tree structure (of the subroutine Generate-Set), each leaf node corresponds to a maximal bicluster. In a tree, the number of internal nodes is bounded by the number of leaf nodes and each leaf node is hit only as many times as the number of features in each pattern, thus assuming the output size is N (the total number of features in all the maximal biclusters) and the second step of the algorithm takes O(N log n) time. Thus, the time taken by the complete algorithm is O((nm þ N) log n), where N is the size of the output and nm is the size of the input.