G = MAT: Linking Transcription Factor Expression and DNA Binding Data

Transcription factors are proteins that bind to motifs on the DNA and thus affect gene expression regulation. The qualitative description of the corresponding processes is therefore important for a better understanding of essential biological mechanisms. However, wet lab experiments targeted at the discovery of the regulatory interplay between transcription factors and binding sites are expensive. We propose a new, purely computational method for finding putative associations between transcription factors and motifs. This method is based on a linear model that combines sequence information with expression data. We present various methods for model parameter estimation and show, via experiments on simulated data, that these methods are reliable. Finally, we examine the performance of this model on biological data and conclude that it can indeed be used to discover meaningful associations. The developed software is available as a web tool and Scilab source code at http://biit.cs.ut.ee/gmat/.

where (·) + denotes the Moore-Penrose pseudoinverse of a matrix, I denotes a properly-sized identity matrix and K and L are any two n M × n T matrices. The minimum norm solution to the problem (5) can be computed aŝ Theorem 2 The problem (5) has a unique solution if and only if the columns of M and the rows of T are linearly independent, that is, rank(M) = n M and rank(T) = n T . The corresponding solution can be computed aŝ where (·) T denotes matrix transposition. † University of Tartu, Tartu, Estonia. ‡ Quretec Ltd, Tartu, Estonia.
Proof of Theorem 2. First, note that the objective function in (5) is a quadratic convex function simply because the problem is equivalent to the least squares estimation of a "traditional" linear model. Therefore, there must exist a minimum, not necessarily unique, which is achieved precisely in the point(s) where the gradient becomes 0. Hence, to find the minimum, we solve as follows: We take the derivative of the sum of squares: We divide the last equation by −2 and rewrite it conveniently in matrix form: Proof of Theorem 1, Equation (21). As we know from the proof of Theorem 2, the solutions to the problem (5) are precisely the solutions of equation (24). It remains to show that these are precisely all matrices of the form (21).
Sufficiency: LetÂ LS be given by equation (21). Then it is a solution of (24). To show it, we first note that, due to the properties of Moore-Penrose pseudoinverses, Then we can establish thatÂ LS = A . To see that, simply substitute the expressions for K and L . We do it in several steps. First, note that for any X: Finally, substitute (29) and (30) into (28): As we know, A satisfies (24). Therefore, it follows that Due to the properties of Moore-Penrose pseudoinverses, this further simplifies to Substituting (32) into (31) produces the desired result and thus completes the proof.

Ridge Regression
The following lemma provides some rationale and intuition for the G=MAT ridge regression estimate.
Lemma 1 (2-step linear regression) LetÂ LS be the minimum norm least squares fit (22) for a given dataset (G, M, T). It is possible to computeÂ LS in two steps: 1. First computeĈ LS as a minimum norm least squares fit for the model G = MC.
2. Then, computeÂ LS as a minimum norm least squares fit for the model Proof. The minimum norm least squares fitĈ LS for the model G = MC can be computed as:Ĉ LS = M + G . The minimum norm least squares fitÃ LS for the modelĈ LS = AT can be computed as:Ã LS =Ĉ LS T + . By combining the two equations above, we see that finalÃ LS is indeed equal tô A LS from equation (21): The observation above shows, that the matrix A in G=MAT can actually be computed step by step: • First express the gene expression values G as linear combinations of motifs: G = MC. This results in a n M × n A matrix C, that, in a sense, shows the "activity" of each motif in each experiment.
• Next, express this "motif activity" in each experiment as a linear combination of transcription factor activities: C = AT. Now the resulting A can be interpreted as showing the "transcription factor contribution to motif contribution to gene activity", which is exactly how we interpret A in G=MAT. Of course, the order of the steps can be reversed. Now we can easily interpret the ridge-regression estimate as a two-step linear regression with 2 -regularization at each step.

Sparse Regression
As it is mentioned in the main text, the optimization problem (12) has to be solved using iterative methods. In our experiments we considered two approaches: Iterative Thresholding [DDDM04] and Least Angle Regression (LARS) [EHJT04].
Both algorithms can be adapted slightly to the G=MAT model by noting that the expression M T GT T can be used to compute the inner products of all features with the output simultaneously (this is the analogue of the expression X T y in the classical linear model literature). The computational complexity of such an expression is O((n T + n M )n G n A ). This or a similar computation is typically the dominant part of every iteration in most standard sparse regression algorithms.
The LARS algorithm, in addition, requires the computation of the equiangular vector. On the k-th iteration the latter operation typically requires the inversion of a k × k matrix, and can therefore dominate the complexity as the number of iterations approaches the total number of features n M × n T in the model.
We refer the reader to the original papers [DDDM04,EHJT04] and to the G=MAT sample implementation on the website [Tre] for more details.

Correlation-based Estimate
Theorem 3 Assume that random variables satisfy the condition (16). If the variables M and T k are not constant and are pairwise independent from other random variables M 1 , . . . , M nM , T 1 , . . . , T nT , then To prove this theorem we shall use the following result.
Lemma 2 Let M , T k be pairwise independent random variables. Then Proof. Using the equation cov(X, Y) = E(XY) − E(X) E(Y) and remembering that M and M λ are independent of T k and T κ , we compute: For = λ, this equation evaluates to: because E(∆M ) = 0. Similar result holds for k = κ. Finally, for = λ, k = κ: which completes the proof.
We can now prove the main theorem.
Proof of Theorem 3. By definition of random variables G, M and T k : because of the linearity of the covariance operation. The last term in the sum (44) is zero: cov(ε, ∆M · ∆T k ) = 0 , because ε is independent of M and T k . All the remaining terms can be evaluated using Lemma 2. It follows that the only nonzero term in the sum (44) is the one where λ = and κ = k. The whole equation now becomes cov(G, ∆M · ∆T k ) = α k D(M ) D(T k ) , from which the result follows.

Centered Least Squares
It turns out, that it is possible to compute a good approximation to the correlationbased estimate efficiently using the familiar least-squares technique. Let (G, M , T k ) be random variables that satisfy the conditions of Theorem 3. Let (G, M, T) be a G=MAT dataset obtained as a sample of these variables. In the following we prove that if we apply the least squares method to the centered versions of matrices G, M and T, we shall obtain consistent estimates for α k .
In the following, let M ( * ) denote the average of the values in the -th column of M, let T (k * ) denote the average of the k-th row of T and let G denote the average of all elements of G. Let ∆M be the column-wise centered version of the matrix M, ∆T be the row-wise centered version of the matrix T, and ∆G be the centered version of the matrix G. That is, Finally, letÂ CLS be the least-squares estimate for the model: Then the following result holds.
Theorem 4 (Centered least squares) Let (G, M , T k ) be random variables, satisfying the conditions of Theorem 3. Let (G, M, T) be a sample, obtained from these variables and letÂ CLS be obtained as described above. Then as the sample size increases, i.e., n G , n A → ∞, the elements of the matrixÂ CLS converge almost surely to the true model parameters (α k ).
Proof. First, note that in the process n G → ∞ the matrix 1 nG ∆M T ∆M converges to the matrix of motif covariances. Indeed, which, by strong law of large numbers, converges almost surely to cov(M , M λ ), and therefore Similarly, in the process n A → ∞: Thus, the matrices ∆M T ∆M and ∆T∆T T converge with probability 1 to diagonal matrices with nonzero elements on the diagonal. The latter matrices are invertible and therefore, for sufficiently large n G and n A , the matrices ∆M T ∆M and ∆T∆T T are also invertible, because the determinants of those matrices must converge to nonzero values. Hence, for sufficiently large n G and n A the least squares estimateÂ CLS can be computed using equation (23): Finally, the matrix must also obey the law of large numbers: The equations (45), (46) and (48) demonstrate that all terms on the right side of the equation (47) converge. Therefore,Â CLS must converge too: by Theorem 3.

Randomization-based Attribute Selection
In the main text, we proposed a z-score-based method for identifying the most "interesting" coefficients. Analogously to estimating z-scores for the coefficients, we might measure their p-values: A low p-value indicates that obtaining value larger thanα k is rather unprobable and thus the corresponding coefficient is significant. Another useful nonparametric test is obtained by estimating the sampling distribution of the coefficients using a resampling technique like bootstrap. In particular, we propose to measure the positivity weight of the attribute α k as the probability that its estimated value will be greater than zero: In our experiments we use the ridge-regression estimation method for the function A(·), because it is the most efficient of the available strategies.

Model Performance Analysis
In order to assess the model's performance we applied it to several artificial and real-life datasets. As we are mainly concerned here with the question of whether the model is, in general, useful for dealing with biological data rather than the specific results, we limit our presentation with a single detailed case-study here. The results of model evaluation on the other datasets (yeast stress response, human lung fibroblast transcription profiling) are similar in spirit. These can be accessed from the supplementary website. Data preparation Some preprocessing was required to obtain the matrices G and T.

The Spellman Dataset
• The dataset contained missing values. We imputed them using the KNNimpute algorithm [TCS + 01] implemented in Matlab.
• Of all the genes present in the dataset, 318 had the transcription regulator activity GO annotation (according to the data of the SGD project [SGD]). We selected these 318 genes as transcription factors. The expression values of these TFs were collected in the 318 × 77 matrix T.
• From other 5860 genes we selected 5766 for which the genomic sequence was available via the SGD website. The expression values of these genes were collected in the 5766 × 77 matrix G.
Next, we prepared the motif matrix M as follows.
• For each target gene we downloaded its promoter region: 800-nucleotidelong genomic sequence upstream of the gene's translation start site. We obtained the data from the SGD website.
• We used the TRANSFAC database of regulatory motifs and selected the 36 known yeast motifs there.
• We used the storm [SSZ07] tool to match TRANSFAC motifs on the promoter. As a result we obtained the 5766 × 36 binary matrix M.
G=MAT analysis. Once the data was in the form of the G, M and T matrices, we were free to apply any of the G=MAT parameter estimation methods. We chose to apply the least-squares estimate (22), because this method is the most straightforward and requires setting no parameters. Recall that each coefficientα k of the resulting parameter matrixÂ LS * measures a putative association between motif m and TF t k . Table 1 presents the pairs of motifs and TFs that correspond to the 5 coefficients of the matrixÂ LS * with the largest values.
Motif TF Score Binding site for GAL4.
Galactokinase, phosphorylates alpha-D-galactose to alpha-D-galactose-1-phosphate in the first step of galactose catabolism.
Binding site for GAL4.
Transcriptional regulator involved in activation of the GAL genes in response to galactose.
Binding site for GAL4.
Transcriptional regulator involved in the repression of GAL genes in the absence of galactose. Transcription factor that activates expression of early G1specific genes, localizes to daughter cell nuclei after cytokinesis and delays G1 progression in daughters, localization is regulated by phosphorylation.

Evaluation of Results
As we have chosen a true biological dataset for the experiment, we can not know the corresponding "true" parameter matrix A. As a result, we lack any gold standard for objectively assessing the relevance of the obtained results and have to limit ourselves with the following options: • evaluating the biological meaningfulness of the results manually, • evaluating the generalization ability using a split-set experiment, • examining the predictive performance of the model.
The results evaluations seem to differ radically.
Biological relevance of the results. The top-scoring motif-TF pairs, presented in Table 1, make complete sense. In particular, the top three entries associate the known binding site of the Gal4p transcription factor with the Gal1p, Gal3p and Gal80p proteins. This is in perfect accordance with current biological knowledge [LVZ95].
• Overexpression of galactokinase-coding gene GAL1 results in activation of Gal4p protein [BOH90,BH92]. This process is probably the reason for strong positive association between GAL1 expression and presence of the Gal4p binding site: high expression of GAL1 induces expression of Gal4p, which binds to that motif.
• GAL3 is a gene highly similar to GAL1 [PRHR00]. The corresponding protein Gal3p forms a complex with Gal80p and thus relieves inhibition of Gal4p [LVZ95]. This explains the positive association of GAL3 with the Gal4p binding motif.
• Gal80p is a transcriptional inhibitor of GAL genes (in particular, GAL4 ). Inhibition is relieved by Gal1p or Gal3p binding [TRR02]. Therefore, we might expect Gal80p to be strongly negatively associated with the Gal4p binding site: the higher the expression of Gal80p, the more repressed is Gal4p, the less effect it has on the genes with the corresponding motif in the promoter. Our analysis, on the contrary, indicates a positive association. This might be due to the complex regulatory feedback loops involved in the regulation of the whole family of GAL genes. Such nonlinear relations cannot be accounted for by our simple linear model. Nonetheless, we consider the discovered relation a success rather than a failure of the analysis.
It is very unlikely that this result (three obviously successful pairs among the top five) could be obtained by chance. To verify that, we considered all motif-TF pairs, where the TF belonged to the same family of proteins as the binding factor for the motif. For example, in the above case, GAL1, GAL3 and GAL80 all belong to the same family as GAL4 which is the binding factor for motif F$GAL4 01. There were 67 such matching pairs among the 36 × 318 coefficients in the parameter matrix. If we were to pick 5 pairs at random, we would expect less than 0.03 hits on average. Getting 3 hits instead exceeds the expectations more than 100-fold.
Generalization ability. In order to test the generalization ability of the model we have performed a split-set experiment on the Spellman data. Namely, the whole dataset of 77 experiments was divided into two nonintersecting parts consisting of the first 40 and the last 37 experiments respectively. Applying the G=MAT inference methods on these two datasets resulted in top-ten lists of pairs that typically contained between 3 and 4 common TF-motif pairs, depending on the chosen parameters. This can be regarded as an indication that the model does generalize well on biological data. Indeed, the p-value of this result, i.e. the probability of obtaining two top-ten lists with 4 common elements from a "random" algorithm, is less than 10 −10 .
Predictive performance. We could expect that if our model parameters have biological relevance and are reproducible in split-set experiments, the model should predict well. Unfortunately, it turns out that the predictive performance of the constructed model is very poor. The coefficient of determination of the model is barely above 0.05, see below.
Mean squared error: 0.1494 Variance of G: 0.1576 Coefficient of determination, R 2 : 0.0520 It might seem surprising, that a model which produces biologically meaningful results predicts so badly. However, this is mostly a flaw of intuitionthe model is in fact highly significant. Indeed, note that in terms of motifs we have a linear model that is capable of explaining 0.0082 units of variance out of 0.1576 using just 38 parameters out of a maximum 5766. This corresponds to an F-value of F = 0.0082/37 0.1576/5728 ≈ 8.06 which is highly significant (p-value 10 −5 according to the F (37, 5728) distribution). To further alleviate any doubts, in the following we study this issue on an artificially generated dataset.

The Artificial Dataset
To study the properties of G=MAT estimates, we need a dataset that is similar to a real one yet for which we know the true value of A. To construct such a dataset, we first studied the statistical properties of the Spellman dataset considered in the previous sections. We then randomly generated matrices M, A and T to obtain a dataset with similar properties.
We generated a dataset with 100 transcription factors, 100 motifs and 1000 genes.
The motif matrix M. The matrix M is a binary matrix of motif occurrences. As a natural simplification, we regard each column of this matrix as a set of i.i.d. realizations of Bernoulli-distributed random variables. That is: where p is the probability of motif m presence in a randomly chosen gene. By analyzing the Spellman dataset we see that the average absolute correlation of the columns of the matrix M is less than 0.02. We can therefore safely assume the columns to be independent. Finally, we observe that the distribution of probabilities p can be reasonably well described by a Beta(1, 4) distribution, see Figure 1. Therefore, we generate the artificial matrix M as follows: p ← Beta(1, 4) independently for each column , The TF expression matrix T. The matrix T of the Spellman dataset is more difficult to model than the motif matrix. Visual inspection shows no clear column or row-based structure. The mean absolute correlation of matrix rows is 0.16, and of the columns: 0.14. We chose to model T as a set of independent columns. Even though it does not correspond exactly to what we observe on the Spellman dataset, the assumption of independent microarray experiments is quite plausible.
The distribution of values in each column of T can be very well described by normal distribution (see Figure 2). Examination of the means and standard deviations of different T columns shows that these can also be reasonably well described by normal distributions (Figure 3). Therefore, we generate the artificial matrix T as follows: m j ← N(0, 0.08 2 ) independently for each column j, σ j ← N(0.35, 0.09 2 ) independently for each column j, T kj ← N(m j , σ 2 j ) independently for each row k.
The parameter matrix A. We generate matrix A as follows: first set all values to zero, then choose 3% of the coefficients randomly and set their values to 0.05. The reason for the choice of such a strategy is the following. Firstly, we believe the true A to contain a very small number of nonzero elements. Secondly, we wish the distribution of values in T to be reasonably similar to the distribution of values in G in terms of mean and variance. Experiments showed that this can be achieved by setting the nonzero values of A to 0.05.
The gene expression matrix G. The matrix G is generated according to the model G = MAT. In some experiments we also add noise to it, but we discuss this further in the text.

Prediction versus Attribute Selection
As we saw in Section 2.2, a G=MAT model can discover relevant parameters without achieving a satisfactory predictive performance. In this section, we explain this phenomenon by demonstrating both experimentally and theoretically how and why this can happen. We start with an artificial dataset introduced in the previous section and consider two factors that influence model prediction error: noise and incomplete data. We show that although these factors can significantly decrease the model's predictive ability, they do not prevent the model from discovering relevant parameters. In the following we refer to this capability as attribute selection.

Experimental Setup
Let G a , M a , A a and T a denote the matrices of the artificial dataset, generated as described Section 2.3. In the experiments that follow, we introduce certain modifications to these matrices (such as add noise to G a or drop rows from T a ), estimate the parameter matrixÂ using different methods and examine the performance of the estimated models in prediction and attribute selection. We measure this performance as follows.
Predictive performance. To measure predictive performance we simply consider the mean squared error of the model: Attribute selection performance. To measure how well the estimated matrix can be used for attribute selection we use the ROC area under curve (ROC AUC) statistic. We consider those model parameters for which the true value α a k was nonzero (in other words, parameters with values 0.05), as positives, and all the rest as negatives. We then sort the parameters according to their estimated valuesα k and evaluate the ROC AUC of this sorted list. If the positives all turn out to be on top of the list (i.e., their values estimated as the largest), the value of ROC AUC will be 1. This means the model is perfect for attribute selection. The ROC AUC score of 0.5 indicates a model that is not better than  random in guessing the relevant parameters. Finally, the ROC AUC score is 0 if all the positives end up on the bottom of the list. This corresponds to a model that is good for attribute selection, but somewhy orders attributes in reverse.

Influence of Noise
Consider the following modification to the matrix G a of the artificial dataset: where ε is Gaussian noise with mean 0 and variance σ 2 ε . This corresponds to incorporating into the data a multitude of independent factors that cannot be accounted for by the model parameters: external influences, nonlinear expression regulation rules, measurement errors, etc. Let us examine how the prediction and attribute selection performance depends on σ 2 ε . For that we created 11 artificial datasets, differing only in the variance of the noise σ 2 ε , and applied the least squares method and the ridge regression method with parameters λ M = λ T = 100 to these datasets.
Effect on prediction error. When the data is noise-free, i.e., σ 2 ε = 0, the least squares method will restore A a precisely and the prediction error will be zero. When σ 2 ε > 0, the least squares estimate is not able to account for most of the noise and the predictive error is proportional to σ 2 ε . Other methods, such as ridge regression, behave similarly. Their prediction error, by definition, is always greater or equal than that of the least squares estimate. Experiment results are presented in Figure 4 (left).
Effect on attribute selection. Figure 4 (right) shows how ROC AUC score depends on the noise. As we see, the ROC AUC of the least squares estimate deteriorates quite quickly with the introduction of noise. This is a natural result of overfitting that takes place here due to the improperly large number of parameters of the model. If the number of TFs n T was less than the number of arrays n A , the ROC AUC of the least squares estimate would deteriorate much slower. We illustrate this by considering an artificial dataset with only 50 transcription factors instead of 100, the results are presented in Figure 5. Note that in both cases the ridge regression estimate avoids overfitting and has a high ROC AUC score despite the noise.

Influence of Incomplete Data
Another important factor that can influence the predictive performance of the model in practice is the incompleteness of data. Let us now generate the noisefree version of the artificial dataset (G a , M a , A a , T a ) and then drop the first n rows of T a and the first n columns of M a . In some sense, this is similar to adding noise to G a , but conceptually this is somewhat different. This time we say that our model is true, we just do not have enough data.
Effect on prediction error. When n = 0 the prediction error is 0 and when n = 100 the prediction error equals G a 2 , increasing rather uniformly in between, see Figure 6 (left).
Effect on attribute selection. Figure 6 shows that when we remove some TFs and motifs from the dataset, thus fitting a model with a smaller number of parameters than needed to explain the data completely, the ROC AUC score still stays satisfactorily high.

Theoretical Justification
In addition to the obvious experimental evidence, we provide an elegant theoretical justification, demonstrating how the model can estimate relevant parameters from incomplete data without being able to predict well. Let us return once more to the example with the Spellman dataset from Section 2.2.
Motifs are the bottleneck. The major bottleneck for model performance on the Spellman dataset was actually the small number of motifs. Indeed, if the number of motifs n M were greater or equal to the number of genes n G , model error would necessarily be 0. In our case, however, the number of motifs n M = 36 is significantly smaller than the number of genes n G = 5766. It follows from basic linear algebra, that an arbitrary vector of dimension 5766 can not in general be represented as a linear combination of just 36 base vectors. Therefore, also the G=MAT model error must be high.
To illustrate this more formally, we consider the following model: where C is an n M × n A matrix of parameters. First, note that this model is better than the G=MAT model in terms of prediction.
Theorem 5 Let G, M, T be G=MAT matrices. LetĈ LS be the least squares fit for the model (52) and let  Figure 4, there is no overfitting here and the least squares estimate can tolerate the noise well. Nevertheless, the ridge regression estimate is still more stable. Proof. LetĈ * =Â LS T. The error of the G=MC model for such value of C is then equal to By definition,Ĉ LS is the matrix of model parameters for which the model error is minimal and therefore Next, note that in the G=MC model (52) each column of G is represented as a linear combination of the columns of M, with coefficients given in the corresponding column of C. As we have noted above, this is rather improbable that there exists an exact representation of the 5766-dimensional column of G in terms of just 36 columns of M and therefore it is not surprising that with high probability there does not exist a C for which the error of the G=MC model would be low on the Spellman dataset. As it follows from the above theorem, the error of the G=MAT model must be at least as high as that of the G=MC model. Evaluation on data shows that, in fact, the error of the G=MC model on the Spellman dataset is exactly equal to the error of the G=MAT model. This explains why the bottleneck lies precisely in the low number of motifs: it is just not possible to predict better using a linear model with so few motifs.
Introducing latent motifs. We could improve the predictive performance of the model if we added additional motifs m nM+1 , m nM+2 , m nM+3 , . . . , m nG to the data. Suppose that these motifs do exist and let the unknown motif occurrence matrix for these new motifs be M new . The full motif matrix M full is then the concatenation of the matrices M and M new : We also incorporate new rows into the parameter matrix to account for the new motifs. The full parameter matrix A full is therefore: The G=MAT model in the presence of the new motifs now takes the following form:Ĝ where B is the additional matrix of parameters that needs to be estimated. It is natural to estimate the parameters (A, B) of the model (53) using the least squares method with a penalty on B.
Definition 1 Define the least squares fit for the parameter matrices (Â LS ,B LS ) of the model (53) as follows: where λ > 0 is the penalty coefficient.
Quite surprisingly, the solutionÂ LS to (54) is exactly equal to the least squares solution of G=MAT (21).
Theorem 6 The solution to the problem (54) can be computed as follows: where K and L are any n M × n T matrices.
Proof. The proof is similar to the proof of Theorem 2. The objective function in problem (54) is a convex quadratic function and to find its minimum we search for the points where the gradient is zero: We start with equation (57) and solve as in Theorem 2: Next, we proceed with equation (58) similarly: Consider the derivative of −(BT) ij with respect to B ικ : Substituting it into (60) gives j −2(G ιj − (MAT) ιj − (BT) ιj )T κj + 2λB ικ = 0, for any ι, κ , which can be rearranged to the matrix form: Now substitute (62) into (59): and thus, because λ = 0: As a result, we can transform equation (59) to the familiar form: Finally, substituting (63) into (59) we get: The matrix (TT T +λI) is always invertible for λ > 0, and therefore, the solution B LS is:B The result of Theorem 6 means that by introducing a number of unknown (latent) motifs into G=MAT, we still keep the value and interpretation of the parameter matrixÂ LS at the same time elegantly getting rid of the "motif bottleneck". The predictive error of the new model (53) is significantly lower. For example, on the Spellman dataset the G=MAT+BT model has a mean squared error of 0.
It should be noted, that there exist approaches, typically known as Network Component Analysis (NCA) methods, which attempt to produce meaningful fits to latent features [KYB + 04, ND06]. However, such fits are only possible when certain additional information is available, such as ChIP-chip binding data. In addition, there is no obvious way to link latent features to actual motifs, i.e. to split the estimate of B to a product M new A new .
To summarize, we have presented both experimental and theoretical evidence for the possibility of our model to perform well for attribute selection despite the very high mean squared error.

Comparison of Methods
Finally, we use the artificial dataset to compare the attribute selection performance of the different G=MAT estimation methods.

On the choice of the λ Parameter
A number of methods require setting the regularization parameter λ. To perform a fair comparison, we evaluate the performance of these methods for several values of λ on a single dataset.
Test dataset. We constructed the test dataset as follows. First, we generated a noise-free dataset (G a , M a , A a , T a ) as described in Section 2.3. Let σ 2 = D(G a ) be the variance of G a . We added zero-mean Gaussian noise to G a with variance 4σ 2 . Finally, we dropped 80 rows of T a and 80 columns of M a thus leaving only 20 TFs and 20 motifs in the dataset. We believe that such a noisy setup with a significant lack of information might correspond closely to the real biological situation.
Methods. In the experiment we compared the following G=MAT methods: regularized least squares, ridge regression with λ M = λ T = λ, sparse regression (implemented using iterative thresholding), p-value-based attribute selector for ridge regression, z-score-based attribute selector for ridge regression and positivity-based attribute selector for ridge regression. For each method and for each value of the regularization parameter λ we computed the ROC AUC score as described above.  Results. The results are depicted in Figure 7. We can clearly see that there was no significant difference in performance among all methods on this dataset -for all methods λ could be rather freely chosen within the range 0 . . . 10 with little effect on performance: the relatively small number of model parameters prevents overfitting and hence there is no real need for regularization.
Choice of λ on real data. It must be noted that in the experiments on real data (the Spellman dataset) the choice of regularization parameters within a wide range does not seem to have a significant effect on the ordering of the parameters either. For example, the set of top ten parameters computed using ridge regression with λ M = λ T = 0.01 is exactly the same as the set of top ten parameters computed using λ M = λ T = 1, and has 5 common elements with the set of top ten parameters computed using λ M = λ T = 100.

Comparison of Performance
In the previous section, we compared the parameterized algorithms on a single dataset and discovered that most of them perform well when λ = 10. Here, we fix this value of λ for all parameterized methods. This way we get rid of the parameterization issue, and can perform a fair comparison among all methods.
Test setup. We test each method 100 times by generating a new random dataset on every iteration. Each dataset is generated as described in the previous section. The ROC AUC scores of all runs are averaged to produce the final score.
Methods. We add the following parameter-free methods into the comparison: least squares regression, correlation-based estimate, p-value, z-score and positivity based attribute selectors for least squares, and the LARS algorithm. When assessing the output of LARS, instead of regarding actual estimated parameter values, we score parameters by the order in which they are introduced into the model during LARS iterations. This parameter ordering is then used in the ROC computation.
Results. The results are presented in Figure 8. In general, all methods perform quite well, most of them achieving ROC AUC score higher than 0.9. The best performance (0.983) is achieved by the correlation-based estimate. Although it differs only slightly from the second best result (LARS, 0.965), the difference is statistically significant. In 76 iterations out of 100 the correlationbased estimate was better than the one based on LARS. The results also clearly demonstrate how the p-value and z-score-based randomization techniques improve the performance of the base methods.
It it notable, that the sparse regression estimate based on iterative thresholding is outperformed by the more conventional linear regression techniques. This might be in part due to the abundant gaussian noise in the test dataset, which misleads the 1 penalty stronger than the more accomodating 2 -regularization. Note that the high performance of the LARS method can be explained by the fact that in order to apply LARS the data must be centered. As we will see in the following, centering significantly improves the performance of all G=MAT estimation methods. In the following we avoid the use of sparse estimates because they require extra computational power but do not offer significant benefits.
The Effect of Centering. The correlation-based estimate, despite its good performance on our test dataset, can be computationally expensive. In Theorem 4 we have demonstrated that an approximation can be obtained by applying the least squares method to the properly centered data. Here, we demonstrate that this is indeed the case. Figure 9 introduces the centered least squares and centered ridge regression into comparison as well as their p-value and zscore randomizations. We see that the centered versions of the least squares and ridge regression perform even better than the correlation-based approach, Mean ROC AUC score thus beating all other methods. It is worth noting that the p-value and z-score randomizations could not further boost their performance.