A MINE Alternative to D-Optimal Designs for the Linear Model

Doing large-scale genomics experiments can be expensive, and so experimenters want to get the most information out of each experiment. To this end the Maximally Informative Next Experiment (MINE) criterion for experimental design was developed. Here we explore this idea in a simplified context, the linear model. Four variations of the MINE method for the linear model were created: MINE-like, MINE, MINE with random orthonormal basis, and MINE with random rotation. Each method varies in how it maximizes the MINE criterion. Theorem 1 establishes sufficient conditions for the maximization of the MINE criterion under the linear model. Theorem 2 establishes when the MINE criterion is equivalent to the classic design criterion of D-optimality. By simulation under the linear model, we establish that the MINE with random orthonormal basis and MINE with random rotation are faster to discover the true linear relation with regression coefficients and observations when . We also establish in simulations with , , and 1000 replicates that these two variations of MINE also display a lower false positive rate than the MINE-like method and additionally, for a majority of the experiments, for the MINE method.


Introduction
The Problem: The researcher wishes to carry out model-guided discovery about a system from a sequence of n experiments. The challenge is that each of the n experiments performed is very expensive, and so at each stage (nz1) it is desirable to design the next experiment to be maximally informative. The approach in which n experiments are to be done sequentially in such a way as to capture the most information at each stage n about the underlying model is called utilizing the Maximally Informative Next Experiment (MINE) [1]. The method has been shown to be consistent for one version of MINE [2]. To understand MINE we will consider the linear model, Y~X bz", where Y is a n|1 vector of dependent measurements, X is a n|p matrix of p independent variables, each with n measurements, b is a p|1 parameter vector, and " is a n|1 vector of independently and identically distributed normal N(0,s 2 ) errors.
The problem has four features. First, there are many parameters and limited data (nvvp) so there will be many more unknown parameters than data. In this setting a large sample of variables (p) is to be observed as it is not known in advance which ones are relevant. In fact, typically n*100 while p*10 3 {10 6 . Second, the X matrix is partitioned into two parts, X~(X 0 ,X 00 ), where X 0 is an n|p 0 matrix of independent variables that the experimentalist can control and X 00 is an n|p 00 matrix of independent variables that cannot be controlled under the conditions p~p 0 zp 00 [3]. The X 0 matrix will be referred to as the design matrix. For simplicity, we will assume the entire X matrix is made up of X 0 . Third, the next experiment encompasses stages nz1, . . . ,nzd, where d is the dimension of the experiment. Experiments constitute batches of d observations. The fourth and final feature of the experiment is that each experiment of d observations is very costly, be it time, materials and/or subjects, or financially such as 250,000 per experiment [4]. So at each stage n in the overall study, there is a high premium on choosing the best next experiment. The problem is to discover with reasonably high probability the model b in as few steps (n) as possible. We call this a problem in model discovery because what we wish to know is what linear relation can be discovered from the many variables (p) measured over the time course of the study. Again the number of variables measured is large because it is not known in advance which ones are relevant. The process of discovering the model is cyclical as shown in Figure 1.
This problem makes points of contact with several distinguished problems in statistics and engineering. There are problems in experimental design [5,6,7,8,9] leading to model refinement with n.p, particularly for sequential designs [10]. There are problems in control, as addressed by, for example, with response surfaces [11]. The problem of model guided discovery, we will show, is distinct from all of these because it is focused on discovery (in this case the unknown linear relation).
As an example problem for use in model-guided discovery, suppose the researcher wishes to understand human longevity [12]. The researcher may examine the characteristics of US centenarians. Several thousand variables are measured including genetics, diet, and lifestyle on each centenarian because it is unknown which variable or variables have an effect on longevity. Some of these variables can be controlled, such as diet and lifestyle. Others, like the genes carried by the centenarian, cannot be controlled. In model refinement, the goal is to select a design, X 0 (diet and lifestyle), to reduce the error in the parameters b by consideration of, for example, X 0 T X [7] and its determinant (i.e. D-optimality). In process control with the aid of response surfaces, the goal might be to select a design X 0 to maximize the expected longevity E(y i ) (where E() denotes expectation and y i denotes the longevity of the i th individual in the study) by manipulating diet and lifestyle. Such an engineering approach to extending lifespan has been implemented in the nematode [13]. In model-guided discovery the goal is simply to choose a design X 0 at each stage to discover the factors that determine longevity with as few centenarians (n) as possible and using limited data, to discover potentially many important variables. Another example of this framework in systems biology can be seen in the description of genetic networks at a steady state or system in equilibrium [14]. In this setup a genetic network is approximated to first order by the following linear system: Here the column vector x describes the concentration of mRNAs of genes in a network, and the y vector describes external perturbations. The A matrix captures the network relationships among the genes, and dx dt is the derivative with respect to time.
A steady state is assumed so that the dynamical system reduces to: y~Ax ð2Þ The problem is to infer the network A. An experiment entails measuring all mRNAs under a particular perturbation y, so several perturbations are tested. This setup reduces to a linear regression problem. Such design problems have been considered for nonlinear genetic network models as well [4,8,15,16], but we will not focus on these here.

Model Estimation by the Ensemble Method2
A standard approach to estimating the regression coefficients b is the least squares method. This approach reduces to solving the normal equations below for the least squares estimates of the parametersb b: The challenge in our problem is that X T X will not often be of full rank because of collinearity in the independent variables and because there are so few data points relative to the number of parameters (nvvp). While the normal equations in (3) could be solved by use of a generalized inverse, there are likely to be many solutions that are equally consistent with the data and not one best least squares estimateb b of the parameters in the model. The key is to not find one estimate, but rather an ensemble of estimates consistent with the data Y .
To address this problem the likelihood is consulted at each stage: Since there are so many independent variables with so little data, this surface resembles a golf course with its varied terrain of many hills and sandpits than a mountain or mountain range. However, we can reconstruct the entire likelihood function by Markov Chain Monte Carlo methods (MCMC). By integrating over a standardized L(b) with respect to b and using a particular prior distribution, we can make predictions about the behavior of the system even in the presence of such limited data [17]. So instead of finding one best parameter b to represent the system, instead we construct L(bjY ) or alternatively, the entire posterior distribution with a different prior distribution. These are special cases of the ensemble method, in which some figure of merit is used to select a distribution of models fitting experimental data [17], and as a special case the reconstruction of the standardized L(bjY ) is referred to as the ensemble. In this paper the standardized likelihood can be calculated exactly when the variance (s 2 ) is known, the case to be used here, and with a Gaussian conjugate prior on b from Eqn (4) (see ref [18]): The posterior mean m n is described as: The least squares estimateb b enters into the calculation of the posterior mean m n . In the program description section below, we replace X T X b in (6) with X T Y from (3).
Here the precision matrix which determines the prior distribution in (5) is L 0 and will be taken as b Ã I where b is a positive constant and I is the p|p identity matrix. We also refer to this (5) as an example of the ensemble p(bjY ). In the past we have used a uniform prior over a finite interval [17], but the Gaussian prior distribution insures that the integration can be done along all components of the parameter vector (b) over the whole parameter space and can approach that of a noninformative prior distribution by letting diagonal elements of this matrix become small.
Normally the moments of the ensemble would be calculated by MCMC methods [17], but from (5) we can obtain the moments of b directly and for example the linear model with known error variance.
These moments of the ensemble can be updated as each observation is added.

Maximally Informative Next Experiment
At each stage n, we choose the next experiment X Ã by reference to the ensemble p(bjY ) in (5) to infer the unknown but true regression parameters b 0 . The new design matrix consists of d rows, and after completion of the next experiment is used to augment X n to X nzd orX X . The design of the new experiment is captured in X Ã and the augmented/updated design for all experiments inX X . For each member of the ensemble (b) we make a prediction vector about the d outcomesŶ Y Ã of the next experiment, namelyŶ Y Ã~X Ã b, where b is drawn from the ensemble of models in (5),Ŷ Y Ã is the vector of d predictions for the next experiment, and X Ã is the design of the next experiment. We choose the new experiment X Ã such that we have maximum discrimination between the alternatives in the ensemble. If two random models of the ensemble should have correlated predicted responsesŶ Y Ã for experiment X Ã , the choice of design would be poor as this would not reveal as much information as when two random members should have uncorrelated responses.
One MINE criterion for choice of X Ã was developed by use of a microscope analogy [4]. The object in the microscope is b 0 . The image under the microscopeŶ Y Ã is mapped onto the object b 0 in the field of the microscope, but the mapping is fuzzy and imperfect with their being uncertainty inŶ Y Ã . Let v 0 be a volume in the object space (i.e., the parameter space R p ) under the light microscope where p is the number of parameters, and let v D be the ''image difference volume'' swept out that is viewed. For a microscope, the connection between the two volumes is the model of physical optics. In our context here, the connection is the model The image difference volume is swept out by varying b and b 0 in (8).
The image difference volume depends explicitly on v 0 and how we ''twiddle the dials'' on the microscope though X Ã . The maximally informative next experiment (MINE) criterion is based on this idea that the more volume in v D (v 0 ,X Ã ), the more detail discerned in v 0 . This is achieved by adjusting the data captured in X Ã .
In order to take advantage of this MINE criterion, we must make a selection of the object volume v 0 . The choice is improvised but driven by computational practicality [4]; other choices are possible [19]. We elect to define a ''representative volume'' v D swept out by DŶ Y (b,b 0 ,X Ã ) when b and b 0 are drawn randomly as ''typical'' or average values from the ensemble pair distribution p(b,b 0 jY )~p(bjY )|p(b 0 jY ), the components given by (5). The volume is constructed from the variance -covariance ellipsoid of the image difference volume DŶ Y (b,b 0 ,X Ã ) and is dependent on the choice of experiment X Ã . We define the ensemble distribution of DŶ Y (b,b 0 ,X Ã ) as: The quantity W is any point in the DŶ Y (b,b 0 ,X Ã ) volume and d(:::) is the Dirac Delta Function. The ensemble distribution in (9) specifies an effective difference volume v D (v 0 ,X Ã ) in the image difference space DŶ Y (b,b 0 ,X Ã ) by way of the characteristic ellipsoid of this space specified by: The variance -covariance ellipsoid is centered at the origin, The variance -covariance ellipsoid has the D-matrix eigenvalues and directions of the halfaxes given by the D-matrix eigenvectors. From (10) we can write the covariance ellipsoid in terms of the moments of the ensemble: Normally these moments could be computed by MCMC methods [17], but because of the explicit form in (5) we can evaluate (11) directly from (5) as: Table 1. Parameters for simulations.
The matrix C is defined to be B {1 . A Hilbert Space (HS, i.e., a complete inner product space) formalism is introduced to give a compact form to the MINE criterion. The HS of functions consist of functions defined on the model parameter space, fb : b"R p g, for which the covariance is the HS inner product. This inner product is formally defined as: The components of the observation vector,Ŷ Y Ã i , are represented by: We can write the covariances in terms of the inner product: The ensemble standard deviation of the predictionŶ Y Ã i is equivalent to the HS vector norm or length denoted by jjŶ Y Ã i jj. The norm is defined by jjgjj :~(gjg) 1 2 If the predictionsŶ Y Ã 1 , . . . ,Ŷ Y Ã d are linearly dependent, then the HS prism is defined by the predictions collapses to a lower dimensional one, and the determinant det (D) vanishes. If the predictions are not linearly dependent, then predictions determine an HS prism whose volume is simply given by the product of their vector lengths, namely det (D)~(jjŶ Y Ã 1 jj:::jjŶ Y Ã n jj) 2 . In general the predictions are correlated, and we have the Hadamard Inequality: The ratio det (D) (jjŶ Y Ã 1 jj:::jjŶ Y Ã n jj) 2 can be thought of as a composite measure of the dependence of the predictions and is a function only of the HS angles between predictions.
We are now in a position to introduce a MINE criterion first by introducing the normalized predictions.
The normalized covariance matrix or correlation matrix denoted by R is defined by: This is the correlation matrix among the predictions. We propose the following MINE design criterion V (X Ã ): This criterion is the squared volume of a prism spanned by the normalized predictionsẐ Z Ã 1 , . . . ,Ẑ Z Ã n . Such a criterion is advantageous when the predictions are almost but not actually/completely linearly dependent. This is a situation that has been encountered in practice [4]. This MINE criterion from (12) only depends on the ensemble through its variance-covariance matrix and not its mean in (6). The MINE criterion is also scale-free [4,20]. It clearly differs from the usual model refinement criterion based on X ÃT X Ã or X T X .
In practice the MINE criterion will behave better than X ÃT X Ã for n*100 and p*1000 because its calculation through inverting B is stabilized by L 0 in (12), as in Ridge Regression [21] and will potentially incorporate the data from prior experiments in (12) through the B matrix. Its form also lends itself to optimization for large problems as will be shown under Theorem 1 below and under simulation results later (nvvp).

Maximizing the MINE Criterion
Ideally we would have a necessary and sufficient condition for maximizing the MINE criterion. Here in Theorem 1 we only present a sufficient condition for maximizing the MINE because the necessary condition has not been found. (19), where d ik is the Kroneker delta. From (13) the inner product can be used to represent the covariances as . We now need to introduce two more equivalencies: The fact that any positive semi-definite symmetric matrix, such as C, has a square root gives us the liberty to create such a w i . Since X Ã i CX ÃT k~d ik jjŶ Y Ã i jj 2 , this leads to w i w T kd ik jjŶ Y Ã i jj 2 , which implies any orthonormal basis can be used for w 1 , . . . ,w d .
In particular, the vectors X Ã 1 , . . . ,X Ã n can be selected as the eigenvectors of C once standardized by C {1=2 . One efficient route for maximizing the MINE criterion is then simply to compute the eigenvectors and eigenvalues of C or equivalently, to maximize the parallelpiped whose volume is det (R) [4] and then to normalize them by C in w i~X The choice of normalization still needs to be examined as a model-guided discovery tool. See simulation results for examination of three choices of different orthonormal bases, a (1) normalized eigenvector basis; (2) random basis; (3) normalized eigenvector basis with random rotation.

Model Refinement
A traditional approach to choosing the design X Ã (in contrast to MINE) is to choose X Ã to maximize some simple function of the variance-covariance matrix of the parameter estimates (b) such as the determinant, to create a D-optimal design [6,8]. Consider then the augmented design matrixX X which is not only a function of the current design X but includes the possible design of the new experiment X Ã . This means: Under ordinary model refinement, we wish to minimize some simple function of the variance-covariance matrix of b, such as: Table 2. The effects of varying the noise level s on the power to detect 7 out 10 of the truly nonzero regression coefficients and the false positive rate for the MINE with random rotation.
This can be written explicitly in terms of the new experiment with the identity:X The model refinement criterion is then to maximize det (A) where: The derivative of det (A) with respect to each component of X Ã can be computed from: Here the + is the gradient with respect to X Ã , tr() refers to the trace and Adj() denotes the Adjoint. A necessary condition for maximizing the det (A) is for + det (A)~0. This max determinant (max det) problem is closely related to solutions to an affine formulation of this max det problem, and the problem is most closely related to the analytic centering problem [22]. These authors cast the search for D-optimality in design as a convex optimization problem with the max det problem linear in X Ã and with linear inequality constraints [22]. The linearity in X Ã is achieved by constructing X Ã from a set of rows (or designs) that are known in advance. The optimization problem is then reduced to determining how often each row (design) is used. Here we do not know the rows in advance.

MINE can produce a D-optimal Design
Kiefer and Wolfowitz [23] established that D-optimal designs are equivalent to mini-max designs, which minimize the maximum of the expected loss associated with each possible design. It is natural to ask whether or not there is any such relation between Doptimal designs and MINE. While the model refinement procedure appears to start from an entirely different criterion than MINE, it is possible to establish a relation between these different kinds of optimal designs by imposing the same constraints on the respective optimization problems. When we do this, we can establish: Theorem 2 (Equivalence Theorem of D-optimality and MINE): The MINE procedure in Theorem 1 is D-optimal in the sense that X Ã maximizes det (A) subject to the constraint X Ã j CX ÃT i~1 where A~X T X zL 0 zX ÃT X Ã . Proof: In order for MINE and a D-optimal solution to be directly comparable they need to be maximized subject to the same constraints on X Ã . So we maximize det (A) subject to the following constraint from (12): The constraint insures the columns of X Ã are an orthonormal basis.
We can introduce a related criterionG G(X Ã ) as: Note that det (A)~G G(X Ã )= det (B). So maximizing det (A) is the same as maximizingG G(X Ã ). The maximization problem in (25) is equivalent to: The constant d is the number of observations in the new experiment. We can think of the original optimization problem as equivalent to determining the best set of normalized vectors w j .
From (26), the constraints imply that tr(W T W )~d where d is the dimension of the next experiment (i.e., the number of observations in the next experiment). We also have the trace being the sum of the eigenvalues of W T W .
Constraint (27) implies a constraint on the eigenvalues in (27), but not the converse. To finish the proof we will first maximize G(X Ã ) subject only to (27) reminiscent of [22].
We will then show the solution of this max det problem can also be chosen to satisfy all of the constraints in (26).
Since dvp, we can choose at least p{d orthonormal vectors u v such that: We choose these u v vectors also to be orthogonal to w 1 , . . . ,w d . We will call this subspace of the parameter space as the unexplored subspace. Note that while rank(W T W )ƒd, but dimension (W T W )~p (that is, pwd here). This implies that p{d eigenvalues are zero. (This implies that for p{d eigenvalues, say for v~p{dz1, . . . ,p, are zero. As an example, if p~1000 and d~10, then the last 990 of the eigenvalues are zero. We have that: These degenerate eigenvalues in (30) are associated with the unexplored subspace. This fact along with the determinant being the product of the eigenvalues implies from (26): Now maximize G(X Ã ) with respect to l 1 , . . . ,l d only subject to constraint [27] using the method of Lagrange multipliers (with multiplier W). We find that: These two imply that l v~1 for v~1, . . . ,d on the explored subspace of the parameter subspace.
The result of maximizing with respect to the eigenvalues is that W T W is diagonalized with only the first d diagonal elements being 1. The maximum value of G(X Ã ) is 2 d from (31).
From here, if we choose w 1 , . . . ,w d to be an orthonormal set such that w j w T k~d jk , then we have W T Ww j~wj for all j~1, . . . ,d. Thus w j is an eigenvector of W T W with eigenvalue l i~1 for j~1, . . . ,d. All constraints in (26) are satisfied for the solution to the max det problem with (27).

Choice of prior distribution
The choice for the prior mean vector is reasonably taken as zero since most of the independent variables are not expected to have an effect on the dependent variable y. The only question is the choice of b specifying the precision matrix in B~X T X zL 0 ð Þ where L 0~b I (and specifies the prior). Dumouchel and Jones [24] provide one argument to select b with an idea to making the design robust to violations of linear model assumptions. We will suggest another approach.
Let X T X have eigenvalues l i with corresponding orthonormal eigenvectors u i . Then we can write X T X and B as: The eigenvalues of B are l i zb and have the same eigenvectors as X T X . We can now introduce a new variable: We can loosely think of the uncertainty or standard deviation of the b-vectors (across the ensemble) in the u i direction as: In the absence of any experimental data (l i~0 ), the uncertainty in the u i direction should reduce to: We would expect the uncertainty without experimental constraints (of data) to exceed the uncertainties with data or that: The maximum eigenvalue of X T X provides an upper bound on b. This one would be satisfied, for example, if b were chosen to be equivalent to the weight of one observation in X 0 X . The matrix X 0 X would quickly dominate. The next constraint is more stringent, and so it is not necessary to check that (40) is satisfied.
Another constraint on b arises from the requirement that the true regression coefficients not be too far out in the tails of the prior distribution; otherwise the data through X T X will never find the true regression coefficients. We can think of the prior distribution as equivalent to a fishing-net. We want this net to be well cast to catch the fish.
Introduce b max~m ax (jtrueb k j)~max (jE(b k )j) where the expectation is taken over the ensemble. Then the b-value should be chosen so that This is equivalent to requiring: So with the prior data, X T X , and some idea of the magnitude of the regression coefficients, there are constraints on the prior distribution as specified by the precision matrix and hence b. These constraints are satisfied in the simulations to follow. As an example, if the largest magnitude of a regression coefficient were 50, then bvv 1 50 2 or bvv0:0004. Since we set all variables and know the largest regression coefficient's magnitude is 50, we set b~0:0001. This is a tighter constraint than the first. We also do not want b to be too small to allow C in (12) to still be inverted as in ridge regression.

Methods for simulating four versions of MINE under the linear model
The program MINE to simulate the above procedures is written in Java under version 1.6 and utilizes the version 5 of the Jama library [25]. Details of the input and output of this software are already reported [26]. The program is available in sourceforge.net under the name linearminesimulations. There are four variants on the MINE method described below in this section and summarized in Figure 2.
Following the initialization of many variables (including but not limited to the matrices and vectors to store the X matrix, Y matrix and e, the number of variables (p) and the significance matrix) and the set up of data from the input file, the main part of the simulation program begins. Since the first experiment has no data on which to base a design, this pilot experiment is randomly generated. The simulation program is designed to handle a variety of data. If the number of variables is over 50, then the first experiment is selected to have 10 observations; otherwise, the first experiment has between five and nine observations. The number of pilot observations varies between 0 and 10 as determined by the number of variables. Each value in the pilot is created by generating a random number between 0 and 10 and then dividing it by the p value. The set is not then normalized nor orthogonal. After the pilot experiment is generated, the random number generator is changed depending on which method is used in order to allow for the same pilots but the rest of the numbers generated being different. From here the first set of the dependent Y vector is calculated along with the associated error vector ("). With the initial data generated the real calculations can begin.
Each loop (in which a single observation is added) first consists of calculating the posterior mean (6) and then calculating the significance of the regression parameters in b with the cycle exiting here if the number of loops reaches the target number of experiments. The mean or m n is calculated by (6). Although in this simulation we have the true values b 0 , we use X T Y in (6) rather than using the real value or the previous mean. The significance subroutine takes the most recently calculated mean m n and the whole X matrix. It solves for the posterior variance-covariance matrix of b with (7) with the whole X matrix and multiplies this by s 2 . The z-value of each b is calculated with each individual mean value divided by the square root of the diagonal of the above matrix. The p-value is then calculated using this z-value. These pvalues are then sorted from largest to smallest. The resulting values are checked using a Benjamini-Hochberg method [27] to decide which components of b are significant. The calculation for p-value is done using the algorithm from Press et al. [28] (on p. 221). From here, depending on the specific method in Figure 2, the new observations are calculated, and finally the new Y vector is calculated by Y~X Ã bz" and the cycle repeats in Figure 1. The error vector (") or error values are created by generating a standard normal random variable value and multiplying it by s (which was set by the input file).
MINE-like method. The simple naïve MINE-like method takes the ten eigenvectors of the C matrix in (7) associated with the ten largest C-eigenvalues using a common subroutine to generate the C matrix and simply uses these eigenvectors to generate the next X Ã . MINE Method. The MINE method simply uses the eigenvectors associated with the C matrix as above with the ten largest eigenvalues and multiplies the corresponding eigenvectors by the square root of the B matrix to obtain the next X Ã . MINE with Random Orthonormal Basis. In MINE with a random orthonormal basis a set of ten random orthonormal vectors is generated and then standardized by the square root of the B matrix (B 1=2 ). First, the ten vectors are created from using the random orthonormal set subroutine. Then each individual vector is multiplied by the square root of the B matrix to obtain the next X Ã . MINE with Random Rotation. The MINE with a random rotation method first finds all the eigenvectors in the C matrix as above but selects the set of all degenerate vectors instead of simply the ten largest. Then the method creates a random orthonormal array of Q|Q where Q is the number of degenerate eigenvectors to multiply the degenerate matrix with. This is used to rotate the degenerate eigenvector set. The first ten are then multiplied by the square root of B matrix and used to obtain the next X Ã .
The randomly generated orthonormal set used in both the MINE with a random orthonormal basis and MINE with random rotation is done by first generating a single vector of random Gaussian values. The vector is then normalized to a unit vector. This vector is used as the basis for generating more vectors generated in the same method and is made orthogonal using a modified Gram-Schmidt (MGS) algorithm [29].
To obtain the square root of the B matrix, first C in (7) is calculated. Then the square root of C is solved by using the Singular Value Decomposition (V T DV ) where the V matrix here is the eigenvectors in column form and the D matrix is a diagonal matrix with the square root of the eigenvalues on the diagonals. This is then inverted by the method in the Jama package [25] to get the square root of B matrix.
With each of the four methods the same set of 1000 components of the true b 0 were used. This b 0 only had ten components that were truly nonzero. The order of the nonzero components was not changed in the list of 1000 components. These were the first ten values of b 0 and were as follows: 11, 236, 226, 9, 33, 250, 245, 15, 3, and 17. The program was run with 1000 replicates for each of the four methods. Each replicate had a unique pilot experiment (consisting of ten observations), but these pilot experiments were the same for each method, allowing for a stronger comparison of the four methods. Each individual run had a unique random seed so that the rest of the replicate run would be unique. The error s in the linear model used was 0.01, and all of the prior mean values (m 0 ) were initialized to zero. A summary of the parameters in the simulations is given in Table 1.

Results of Simulation
There are four variations on the MINE procedure examined here and defined in the previous section. The similarities and differences in the pathways of the four methods are summarized in Figure 2. All four methods employ the same subroutines for the majority of their implementation but differ in the way each particular method chooses the next experiment or set of observations to use, as described in the program description above.
The first method is called the naïve MINE or MINE-like method because this version does not incorporate the B matrix used in Theorem 1. This simpler method only calculates the eigenvectors of C and uses the eigenvectors with the ten largest eigenvalues (according to the algorithm) to define the next experiment X Ã . The second method, MINE, takes the MINElike method and simply multiplies the chosen eigenvectors by B 1=2 . The comparison of the MINE and MINE-like method allows us to assay the importance of the standardization in the MINE procedure.
The third method, called MINE with random orthonormal basis, does not use calculated eigenvectors as suggested by Theorem 1 from, for example, matrix C. Instead, the third method creates a set of random orthonormal vectors and standardizes this basis by B 1=2 . This method allows us to examine the effect of choice of the orthonomal basis in Theorem 1 on the performance of MINE. The final method tested is called MINE with random rotation. This method combines the previous methods by taking the chosen set derived in the MINE method but rotates the set by a random orthonormal basis before standardizing using a modified Gram-Schmidt Algorithm [29]. In contrast to the MINE method selecting an orthonormal basis, which is a function of the machine precision in the calculation of the eigenvectors and eigenvalues of C, the MINE with random rotation removes this dependence on the machine precision as the ''randomizer'' and replaces the resulting choice of eigenvectors (axes) with a random spin of the axes defined by the eigenvectors of the C matrix.
There were a number of criteria considered for comparing the MINE methods. These criteria include (1) identifying the nonzero values of b 0 by using a Benjamini-Hochberg multiple test correction [27] at a 1% significance level, (2) identifying the correct sign and value for the nonzero components of b 0 , and (3) determining the number of false positives.
The first criterion discerns if the methods correctly identify the nonzero values of b 0 as being significant or successful discovery as given by the test described above. A method is considered better or more successful the fewer experiments are needed to discover the truly nonzero regression coefficients. In Figure 3 there are four graphs provided (A-D) to display this criterion, and all iterations performed are shown.
The MINE-like method seems to perform the poorest in this criterion ( Figure 3A) in that successful discovery was very late. However, being satisfied with a lower percentage of correctly included independent variables (say 7 out of 10) allows for more replicates meeting this criterion. The method began mostly (over 50% of the replicates) identifying 7 of 10 at the 87 th experiment and only at the 90 th experiment did over 90% of the replicates identify 7 of 10 of the true components of b 0 . For over 90% of the replicates to identify all nonzero values of b 0 at least 97 experiments were required.
The other three methods performed significantly better than the MINE-like method. The MINE method performed almost twice as fast in this criterion as the MINE-like method ( Figure 3B). For example, over 50% identified 7 of 10 at the 45 th experiment and over 90% at experiment 50. Only 63 experiments were required for over 90% of the replicates to identify 9 of 10 experiments. However, to get all nonzero values of b 0 required much more time. Sixty-six experiments were needed to get over 50% and 83 experiments before over 90% of the replicates considered significant.
Both the MINE with random orthonormal basis and the MINE with rotation performed almost identically (Figures 3C and 3D). Both identify 90% of the nonzero values of b 0 in over 900 replicates at the 47 th experiment. However, attempting to identify all ten nonzero components of b 0 in all samples requires much more data, similar to the MINE method. It takes more than 80 experiments for both of these methods to identify all nonzero values of b 0 in over 800 of the replicates.
The second criterion involved identifying the correct sign and value for the nonzero beta values. This is evaluated by methods described earlier. Figures 4A-4D depict an average value for the first 20 values where the first ten are the nonzero coefficients and the second ten are zero and are shown as a comparison. As previously discussed, early detection is important.
As in the previous criteria the MINE-like method performed poorly ( Figure 4A). After the pilot experiment the nonzero b 0 values have the correct sign identified and never change sign though the full run of experiments; however, the experimental b 0 values do not reflect the components of the true vector b 0 until the final experiment. Also interestingly the values increase slowly until about experiment 85 when the absolute value of each b component drastically increases towards the true value.
MINE performs similarly in pattern to the MINE-like method. MINE does just as well at sign detection, with no real variable ever offering the incorrect sign ( Figure 4B). The pattern of the MINE is less gradual than the MINE-like but features a slow growth then a sudden spike and approaches the asymptote of the true value. The MINE reaches the slope change between experiments 50-55 and so it takes the remaining 45-50 experiments to arrive at the plateau of the real value.
The other two methods also outperform the MINE-like method. Again, in this criterion the MINE with random orthonormal basis and the MINE with random rotation perform almost identically ( Figures 4C and 4D). Sign identification seems to be the easiest criterion as these two also perform flawlessly here. Unlike the previous two, these methods seem to have a very linear pattern in the values of the b 0 nonzero components.
The next criterion involves comparing the false positives of each method ( Figure 5). A false positive happens if any of the b 0 that are actually zero are considered significant. For comparison, the average number of incorrectly identified values, averaged over all simulations for each method, is shown.
Again, we see similar patterns where the MINE-like method performs poorest. Initially, it looks like it is performing adequately, since up until experiment 60 there are zero false positives. Since we have to wait until experiment 85 for any reasonable amount of success, we find that the false positives are beyond 400 and at some of the highest peaks compared to the other three methods. The only way this method could possibly be considered better is that it drops off faster at the final observed experiment but since this point is a worst case, this point should not be reached.
The MINE method has a similar pattern to the MINE-like, again performing in a similar scaled manor. At the point where information could be accepted for the nonzero values, around experiment 55, the false positive rate is around 300 which would allow for about 70% of the variables to be eliminated. If more experiments are performed, the number identified peaks just under 460 during experiments 73-84 afterwards it gradually begins dropping.
The MINE with random orthonormal basis and MINE with random rotation again display almost identical results. However, in contrast to the MINE-like and MINE methods, these two have a more linear growth of false positives, especially after the first 15 and before the last 15 experiments. Due to the data being displayed on a single graph, the similarities are more observable with the two lines eclipsing each other. During the most optimal selection periods between experiments 20-45, the false positives do not go over 225. This allows for a greater than 75% reduction in variables. These two do however peak ever so slightly higher with the average reaching just under 465 but a much later experiment and are only lower during a 20-30 experiment window. All simulation results in Figures 3-5 are summarized in an excel file generated with the program MINE under the keyword linearminesimulations at sourceforge.net.
Having determined that the MINE with random rotation and MINE with random orthonormal basis perform similarly and outperform the other two procedures, we conclude by examining the properties of the MINE with random rotation, namely its power and false positive rate, in a situation with increased noise in Table 2. Not unexpectedly with more noise it takes a larger number of experiments before the power to detect 7 out of 10 true regression coefficients are significant is large as the noise (measured by s) is increased. The false positive rate is controlled when one stays below the number of experiments necessary to obtain 7 out of 10 true regression coefficients most of the time. As an example, for a s of 0.05 and a power of 98.3% in Table 2, the false positive rate is only 0.94%.

Discussion
In 2008 a key problem in systems biology was solved as identified by Kitano [30] with a new methodology called MINE [4]. The MINE methodology is used to integrate several cycles of modeling and experiments to yield discoveries about the underlying process being studied. The result of the application of the MINE methodology was new insights into the relation of the clock to ribosome biogenesis [1,4]. This new approach to model-guided discovery has sparked a flurry of developments in MINE methodology [19,31,32,33]. It is natural to ask how this new experimental design methodology of MINE is related to classical experimental design criteria and whether or not we can validate MINE mathematically as a discovery tool when there are many parameters and sparse, noisy data (pwwn). A natural place to validate this new MINE tool is in the framework of the oldest and mostly widely used statistical model, the linear model.
One of the consequences of the work here is to establish another view of one MINE procedure. When the same constraints are imposed on MINE and the D-optimality criterion, then the MINE procedure discussed here is D-optimal under the linear model. The effect of minimizing the determinant of the correlation matrix of the predictions is equivalent to minimizing the determinant of the variance-covariance matrix of the parameter estimates as described in detail in the Equivalence Theorem 2 when the same constraints are imposed on both problems. We suspected this would be the case from the application of the MINE procedure in systems biology, where the application of the MINE procedure appeared to decrease the estimated error variance s 2 over time [4]. In the language of the microscopy analogy, maximizing the volume observed under the microscope by choice of experiment is equivalent to reducing the ellipsoid of variation in the optical field of the parameter space. It is this key relation that Marvel and Williams exploit to address Kitano's problem [19].
Having shown the MINE procedure in practice is useful for discovery [4], it is natural to ask how MINE performs in a simpler setting of the linear model. We explored its performance under four variations. In this simpler setting, where we can actually calculate the ensemble directly without resorting to using Markov Chain Monte Carlo as used in nonlinear systems [34], we can solve the associated optimization problem of MINE in Theorem 1 in a way that may suggest new approaches to MINE in nonlinear models. The result of Theorem 1 was the realization that the maximization of the MINE criterion here is defined up to an orthonormal basis of the data space. There are a variety of different bases that could be selected. Theorem 1 also calls for a standardization of the basis. This standardization does prove important as we see upwards of a 50% improvement in some criteria between the MINE-like and the MINE methods.
First, it was important to see the two more similar methods (MINE with random orthonormal basis and MINE with random rotation) performed similarly. Second, these two proved better in all of the criteria in almost all experiments. These allowed for the earliest detection, during which they provided the closest to true values on all variables, and provided the fewest false positives for a larger sample reduction. The only area at which these two methods were out performed was in the number of experiments needed for most of the simulations to identify the real values given that initial detection had begun. The MINE method once 10% of the simulations began detecting these values was able to reach 90% more quickly. Though this region was smaller for the MINE method, the other two were not only able to reach or arrive at the 10% quicker but generally complete (get over 90%) quicker.
A third consequence of this work is to open up a new convex programming problem that is closely tied to the max det problem so thoroughly analyzed by Boyd and co-workers [22]. The argument here in the max det problem is quadratic in the design parameters with linear inequality constraints potentially as opposed to an affine argument. An open question is whether or not this new problem is a convex programming problem. If so, then much of the machinery developed by Boyd and coworkers could be developed for the problem here. We have illustrated the use of the convex programming procedure in our discussions in this work.
In conclusion, we feel that the MINE discovery tool has opened up many exciting design problems that will transform the way scientists now integrate theory and experiment in a number of areas beyond systems biology [3,35,36].