Analysis and Computational Dissection of Molecular Signature Multiplicity

Molecular signatures are computational or mathematical models created to diagnose disease and other phenotypes and to predict clinical outcomes and response to treatment. It is widely recognized that molecular signatures constitute one of the most important translational and basic science developments enabled by recent high-throughput molecular assays. A perplexing phenomenon that characterizes high-throughput data analysis is the ubiquitous multiplicity of molecular signatures. Multiplicity is a special form of data analysis instability in which different analysis methods used on the same data, or different samples from the same population lead to different but apparently maximally predictive signatures. This phenomenon has far-reaching implications for biological discovery and development of next generation patient diagnostics and personalized treatments. Currently the causes and interpretation of signature multiplicity are unknown, and several, often contradictory, conjectures have been made to explain it. We present a formal characterization of signature multiplicity and a new efficient algorithm that offers theoretical guarantees for extracting the set of maximally predictive and non-redundant signatures independent of distribution. The new algorithm identifies exactly the set of optimal signatures in controlled experiments and yields signatures with significantly better predictivity and reproducibility than previous algorithms in human microarray gene expression datasets. Our results shed light on the causes of signature multiplicity, provide computational tools for studying it empirically and introduce a framework for in silico bioequivalence of this important new class of diagnostic and personalized medicine modalities.


Proof of correctness
Theorem: TIE* outputs all and only the Markov boundaries of the response variable T if the following five conditions hold: (1) The base algorithm correctly identifies a Markov boundary of T both in the original data and in every modified version of the data that results after removing a subset of variables in step 3 of the algorithm (so-called "embedded" distribution). (2) The learning algorithm to fit a predictive model of the response variable T given a set of predictor variables can accurately model the conditional probability distribution of T given the set of predictor variables. (3) The performance metric to assess predictivity of the signatures is maximized only when the conditional probability of T given all other variables is accurately estimated. (4) The procedure to estimate predictive error of the signatures is unbiased. (5) The procedure to compare estimates of predictive error of the signatures has negligible error.
Proof: First we prove that condition (ii) in step 3 of TIE* ( Figure 2 in the main manuscript) does not affect output of the algorithm. This result is important because it simplifies further proof of correctness and also provides a justification for the computational savings in TIE* incurred by using this condition. We prove this result by contradiction. Consider that the algorithm has previously discovered a Markov boundary M i and a subset G' M i is generated in step 3 such that the resulting candidate Markov boundary M new in step 4 is not a Markov boundary in the original distribution. Since removal of G' leads to a non-Markov boundary in the original distribution, the algorithm does not generate supersets of G' in step 3 according to condition (ii).
Assume that there is a Markov boundary M j that is not output by the TIE* algorithm because a subset of variables G'' = M i G' was not generated in step 3. This implies that predictivity of M j of the phenotype is larger than one of M new . Therefore, M new is not a Markov boundary in the embedded distribution after removing a subset of variables G'. This contradicts assumption (1) of the theorem. Therefore, condition (ii) in step 3 of TIE* does not affect output of the algorithm and we can proceed with further theoretical analysis as if this condition does not exist. Assume that TIE* returns X that is not a Markov boundary of T. Then X is not a Markov blanket of T because of condition (1) and the fact that non-redundancy property of X holds in every embedded distribution. X cannot coincide with the variable set M identified in step 1 because it is a Markov boundary of T in the original distribution according to (1 1 or 4). However, in some iteration of the TIE* algorithm, the set G = M i (and similarly other sets that render X independent of T) will be generated in step 3 and M new = X will be identified in step 4 after removing G from the data. Therefore, we have reached a contradiction and we conclude that TIE* would never miss Markov boundaries. (Q.E.D.) The following lemma provides an instantiation of the TIE* algorithm that does not require learning of the predictive model of the phenotypic response variable T. . By self-conditioning property, it follows that ) . By decomposition property this implies that M new is a Markov blanket of T in the original distribution. Since M new is a Markov boundary of T in the embedded distribution and it is a Markov blanket of T in the original distribution, it is also a Markov boundary of T in the original distribution. (Q.E.D.) The above lemma suggests a new step 5 in the TIE* algorithm (see Figure 2 in the main manuscript): If , then M new is indeed a Markov boundary, and it is output.
A more detailed theoretical analysis of the generative TIE* algorithm and the set of concrete sound instantiations is provided in [2].

Algorithm implementation details
In experiments with real and/or resimulated gene expression data, we used the following implementation of TIE*: To identify Markov boundaries, we used the base algorithm HITON-PC ( Figure S2) with Fisher's Z-test [3,4]. The parameters α and max-k of the algorithm were optimized by holdout validation 1 to achieve maximum predictivity of the phenotypic response variable.
To fit classification models of the phenotypic response variable using identified gene sets, we used SVM classifiers [5].
To estimate predictivity of signatures, we used holdout validation method whereby 2/3 of the data was used to identify genes in the signatures and fit the classifier and 1/3 of the data was used to estimate classification performance using AUC metric [6,7].
To compare predictivity of signatures, we used the nonparametric method of Delong et al. [8].
Finally, to run TIE* efficiently we constrained the cardinality of the subset G in step 3 of the algorithm, thus trading off completeness for execution speed.
In experiments with simulated data where both the generative model and all the true Markov boundaries are known, we used a similar implementation of TIE* with the following two differences. First, instead of Fisher's Z-test, we used G 2 test that is suitable for the distribution at hand. Second, we did not estimate predictivity to verify that a new candidate Markov boundary M new is a Markov boundary in the original distribution. To establish this, we directly applied the lemma described above. This allowed avoiding potential errors in predictivity estimation and increase effective sample size (since we did not have to split the data into training and testing sets), thus improving overall performance of the algorithm. However, we did not adopt this strategy for other experiments where the sample size was typically much smaller and the highdimensional conditioning tests used in the above lemma were unreliable.

Classification methods
To build classification models of the phenotype from identified gene sets, we used the support vector machine (SVM) algorithm [5] that is known to be the "best of class" method for classification of gene expression microarray data [9][10][11]. The underlying idea of SVM classifiers is to calculate a maximal margin hyperplane separating two categories of subjects, e.g., cases and controls. Subjects are classified according to the side of the hyperplane they belong to. We used the SVM implementation in the libSVM software library [12]. For experiments with artificial data where the response variable is multicategorical, we used one-versus-rest SVM classifiers [13].

Metrics for assessing predictivity
For experiments with real and resimulated gene expression data where the phenotypic response variable had two categories, we used area under ROC curve (AUC) metric [6]. For experiments with artificial simulated data where the response variable was multicategorical, we used weighted accuracy [14].