Conceived and designed the experiments: IPA. Performed the experiments: EY. Analyzed the data: EY RRA DCD WJ. Contributed reagents/materials/analysis tools: RRA DCD. Wrote the paper: EY.
The authors have declared that no competing interests exist.
One of the challenges in exploiting high throughput measurement techniques such as microarrays is the conversion of the vast amounts of data obtained into relevant knowledge. Of particular importance is the identification of the intrinsic response of a transcriptional experiment and the characterization of the underlying dynamics.
The proposed algorithm seeks to provide the researcher a summary as to various aspects relating to the dynamic progression of a biological system, rather than that of individual genes. The approach is based on the identification of smaller number of expression motifs that define the transcriptional state of the system which quantifies the deviation of the cellular response from a control state in the presence of an external perturbation. The approach is demonstrated with a number of data sets including a synthetic base case and four animal studies. The synthetic dataset will be used to establish the response of the algorithm on a “null” dataset, whereas the four different experimental datasets represent a spectrum of possible time course experiments in terms of the degree of perturbation associated with the experiment as well as representing a wide range of temporal sampling strategies. This wide range of experimental datasets will thus allow us to explore the performance of the proposed algorithm and determine its ability identify relevant information.
In this work, we present a computational approach which operates on high throughput temporal gene expression data to assess the information content of the experiment, identify dynamic markers of important processes associated with the experimental perturbation, and summarize in a concise manner the evolution of the system over time with respect to the experimental perturbation.
With the advent of microarray technologies for measuring genome-scale transcriptional responses, there has been a renewed interest in using computational methodologies to study systemic responses
For deciphering the dynamics of biological responses, temporal gene expression experiments record transcriptional changes over time with the goal of establishing a broader range of co-expression characteristics
In this paper we hypothesize that an emergent relation between genes may be an important feature denoting biological relevance of a gene by being part of a coherent response. This hypothesis arises from a basic concept of systems biology in which the response of an organism to an external stimulus is made up of the synchronized response of a group of genes
In this paper we extend the analysis earlier presented in
We provide first a short overview of the approach so that the reader can follow the discussion without delving into the algorithmic details which will be extensively discussed in later parts of the manuscript. The algorithm is an integrative clustering and selection algorithm. Rather than selecting genes based upon differential expression, the algorithm selects patterns (motifs) within the data based upon the over-representation of that specific pattern and its contribution to the overall response of the system. The proposed algorithm consisted of two primary steps: (i) a fine grained clustering algorithm to identify an extensive list of putative clusters, and (ii) a selection operation to determine which of the clusters are representative of the underlying response. The fine-grained clustering, based on a symbolic transformation, allows for the identification of a large number of possible expression motifs. A selection process which follows allows for the selection of the subset of most critical and characteristic responses. The combinatorial selection of the informative subset of expression motifs will be performed using a greedy and/or a global method. We have identified two metrics for quantifying deviations from homeostasis: a global metric, denoted by Δ, and a time dependent, termed the Transcription State denoted by Δ(t). Our underlying hypothesis is that only informative motifs should contribute to deviations in the metrics from homeostasis.
In the case of the circadian dataset, the application of the greedy selection
a) The transcriptional state as a function of the number of clusters selected for the circadian dataset. Unlike the null synthetic dataset, there is a maximum at an intermediate number of clusters thus signifying the incorporation of important information. b) the transcriptional state over time associated with the greedy selection. This response suggests a periodic circadian characteristic which is in agreement with the underlying data. c). The 24 clusters that were selected as informative genes. The selection of clusters which do not exhibit 24 hour periodicity may be due to the suboptimal greedy selection.
In the case of the acute administration of corticosteroids the maximum deviation of the transcriptional state occurs at some intermediate level
a) The transcriptional state as a function of the number of clusters selected for the acute corticosteroid dataset. Unlike the null synthetic dataset, there is a maximum at an intermediate number of clusters thus signifying the incorporation of important information. b) the dynamics of the transcriptional state over time for the two methods for selection. In this graph, it is clear that the overall characteristics of the dynamics do not change. However, the dynamics associated with the optimal selection is much greater than that of the greedy selection. c) the three clusters assoicated with the greedy selection. All of these clusters appear to have a similar deviation away from baseline and a return back to baseline. d) the optimal selection yields qualitatively similar profiles despite the fact that there is no overlap between the two sets. In this case as in the greedy selection, there is a deviation away from baseline and a return back to baseline.
Under a chronic administration of corticosteroids, we identify a similar level of over-representation in the population dynamics as in the acute administration of corticosteroids. However, while the level of correlation associated with this dataset is not as low as that of the acute corticosteroid dataset, it is evident that there exists a subset of motifs that do show a significantly non-exponential characteristic. During the greedy selection process, we see a response which is qualitatively similar to that of the acute corticosteroid case as well as the circadian dataset in which a maximum is reached at an intermediate number of clusters (4), after which there is a decline,
a) The transcriptional state as a function of the number of clusters selected for the chronic corticosteroid dataset. Unlike the null synthetic dataset, there is a maximum at an intermediate number of clusters thus signifying the incorporation of important information. b) the dynamics of the transcriptional state over time for the two methods for selection. What is evident is that not does the transcriptional state show a larger deviation, but the two wave effect is also more pronounced when utilizing the optimal selection. c) the four clusters associated with the greedy selection. There seems to be two distinct profiles associated with these clusters consistent of a transient regulation and a sustained response that is active after an initial delay d) the optimal selection yields 3 similar profiles.
The transcriptional state for this drug administration shows a similar dynamic, in which there is an initial deviation, and a slight return to baseline, before a second sustained response takes over
The burn injury dataset (GDS599) yielded 4 clusters with 281 probes under the greedy selection
a) The transcriptional state as a function of the number of clusters selected for the burn dataset. Unlike the null synthetic dataset, there is a maximum at an intermediate number of clusters thus signifying the incorporation of important information. b) the dynamics of the transcriptional state over time for the two methods for selection. What is evident is that not does the transcriptional state show a larger deviation, but the two wave effect is also more pronounced when utilizing the optimal selection. c) the three clusters associated with the greedy selection. In the greedy selection there appears to be a two wave effect. d) unlike in the previous datasets, there is a significant difference in the response from the optimal selection vs. that of the greedy selection. Under the optimal response, there appears to be a two wave response, as well as four distinct activation events at four different time points which may represent the activity of a cascade of signaling events in response to a significant thermal injury.
The profiles associated with the four clusters can be described as an early up-regulation event which returns back to a different state, two bi-phasic responses which contain genes which spike at two different time points, and finally a late up-regulatory event. In contrast to the result of the greedy selection, the optimal selection shows a clear progression in the activation of different genes. In the optimal selection, our first cluster shows a similar bi-phasic response as was selected under the greedy selection, whereas our other clusters essentially show spikes at different time points, which indicate a cascade of events occurring in sequence, with spikes occurring at different time points indicating a short period of time when specific stages in a particular cascade are active. Unlike the corticosteroid datasets, the optimal selection vs. the greedy selection yielded some clusters which were qualitatively different, specifically, the appearance of the gene expression profiles which spiked at different time points.
However, through the examination of the transcriptional state, we are able to draw a link between the two results. It is observed that the burn injury appears to have an initial deviation as the organism responds to the original stimulus. Then after the cessation of the initial response of the system, there is a slight return to the baseline at hour four. However, in both cases, there is a massive secondary event which occurs that drives the system either to a new steady state as suggested in the case of the optimal selection, or uncontrollably in the case of the greedy selection. Therefore, while the transcriptional state of the system from hours 0-8 appears to be consistent, the final response at 24 hours appears to be different. Because of the inconsistencies of the burn dataset, numerous questions arise, specifically whether the inconsistencies between the two different selection methods represent an artifact within the algorithm itself, or whether there is some relationship between the two different results, which if resolvable may be more indicative of the underlying biological response, as well as aid in understanding the nature of the differences between the selection techniques.
The micro-clustering performed has allowed us to identify a large family of clusters. Depending on the nature of the data sets, we expect
In this plot, a) The Random dataset exhibits a high correlation with the theoretical exponential distribution (Graphs are Log-Normal) b) The circadian dataset exhibits a similar response suggesting that there is minimal perturbation. The last three datasets c) acute d) chronic e) burn all exhibit a significant deviation, especially in the tail region suggesting that the over-represented motifs occur at a rate greater than would be suggested by chance.
From this response, it is evident that with the incorporation of additional clusters adds noise into the system. Thus, no real information is present within the system.
One of the difficulties associated with the optimal selection of motifs lies in the combinatorial nature of the problem. Thus, even after eliminating the large number of motifs via their population, the combinatorial problem is not eliminated, only mitigated. Adding to this issue is the fact that the problem must be solved parametrically. Currently, we perform an exhaustive search on all possible combinations of m optimal motifs from a base population of M. For the burn dataset, there were only 10 motifs which were over-represented. Thus, parametrically exploring all possible cluster sizes was possible. However, for the two corticosteroid datasets this was not the case. In both of the corticosteroid datasets, we solved the problem parametrically for m<7. By plotting the progression of the cumulative transcriptional state for the burn dataset
It appears that after a certain point, the introduction of additional clusters appears to decrease the transcriptional state.
This suggests that one may be able to reduce the set to be considered for the optimal solution by looking at these peaks only.
One issue of concern for us is the reliance upon over-represented motifs when conducting the optimal selection. Because we are using an exhaustive enumeration of motifs, it is critical for us to identify a subset of possibility meaningful motifs. However, in the case of the circadian dataset, we are unable to identify over-represented motifs, and would thus have to run it upon all of motifs in the dataset. This combinatorial problem has not been addressed in the current algorithm, but can be addressed by more complex heuristics that can be implemented in the future.
The proposed algorithm represents a different method for processing high throughput temporal gene expression data. Rather than assessing the importance of a single gene on a case by case basis, we instead propose examining the importance of a specific pattern. Furthermore, the importance of this pattern is evaluated within the context of its contribution to an inherent underlying dynamic which is not known
We have structured a compendium of transcriptional responses in order to elucidate the insights of the overall approach. A synthetic dataset is created where values are drawn from a N(0,1) distribution in order to illustrate basic properties of the calculations. In addition 4 experimental dataset are evaluated and the raw data can be found in Gene Expression Omnibus database
The first experimental dataset, accession number GDS1629
The second dataset, accession number GDS253
In this experiment, a significant, yet reversible, perturbation has occurred to the system such that there should be a clear deviation away from the baseline case followed by return to the control state.This case was selected to validate the fact that the presence of a significant perturbation is visible along with the non-randomness of the dataset. This dataset has the added advantage of having a well-characterized mechanism which allows for the assessment as to whether the temporal variations in the transcriptional state have meaning with respect to the underlying biological phenomenon. Given the number of time points associated with this dataset, this will be the only dataset which was run with piece-wise averaging. Thus, this dataset was run with a piecewise averaging of 2, such that adjacent points are averaged to obtain a single data point. Because of the fact that 17 time points do not divide evenly into 2, this dataset was extrapolated to 18 time points with the final time point occurring at 80 hours.
The next dataset, accession number GDS972
The final dataset which is evaluated is listed under accession number GDS599. This dataset represents a serious cutaneous burn administered to a rat over 30 percent of the skin. After the administration of the burn, the livers were harvested at 5 individual time points [0, 1, 4, 8, 24 hrs], and the gene expression data was obtained using the RG-U34A microarray. Unlike the corticosteroid datasets in which there is a single reversible perturbation to the system, this final dataset represents the induction of a complex series of events in response to the severe injury. Thus, this dataset will be used to investigate the ability of the algorithm to identify significant and salient changes within the system in response to a complex phenomenon.
The preliminary step, i.e., the fine grained clustering operation, divides the temporal expression data into a large number of clusters in which the similarity between the different expression profiles in a cluster is expected to be very high. As such, any clustering algorithm could in principle accomplish this first task. However, we have elected to explore the basic principles of the HOT SAX representation
In order to emphasize the role of the shape of the responses the data is first z-score normalized such that all of the expression profiles are of the same magnitude:
In this figure, a randomly generated signal is discretized with a piecewise averaging parameter of 2 with 3 equiprobable partitions.
The set AB defines the so-called “alphabet” which is a set of symbols with cardinality equal to the number of equiprobable partitions of the Gaussian curve.
After a gene expression profile has been converted into a sequence of symbols, the sequence is converted into an integer through the use of an appropriate hashing function. Following the formalism of
Because of the underlying equiprobable distribution associated with HOT SAX, randomly generated expression profiles will be assigned different hash values with equal probability. Because of the equi-probable assignment of hash values with respect to randomly generated data, the population of a given cluster can be modeled via a Poisson process. However, in the case where there exists approximately the same number of possible hash values as genes to be hashed, this Poisson distribution can be modeled as an exponential distribution
To evaluate whether a significant perturbation exists within the data, the hash-based clustering is run on the experimental data and a distribution of cluster membership is obtained. This is compared to a synthetic null dataset of the same size in terms of the number of genes and the number of time points generated from the random data with the same number of time points and genes as the experimental dataset. A standard permutation analysis is performed for estimating the statistical significance of a result
The behavior of HOT SAX to randomly generated data thereby allows us to select the parameter AB in a systematic manner. In a real dataset, it is hypothesized that significant coordination will occur, and therefore, the performance of the hashing operation should show deviations from the theoretical exponential distribution. Thus, the HOT SAX algorithm should be run on a given dataset where AB is varied parametrically, and the AB which corresponds to the lowest correlation to the null response will be chosen as the optimal AB.
The majority of approaches for analyzing time course gene expression data are based on the fundamental premise that over-populated motifs are indicative of significant events and thus searches for them as the main priority
In order to address these two issues we introduce a term which we denote as the “transcriptional state”. The transcriptional state is a metric, which quantifies the deviation from a control. The control state can be arbitrarily defined, since we are interested in deviations and not absolute values. We assume that the “control” state corresponds to time t = 0, i.e, right before the systemic perturbation, if any. The baseline state is defined as the distribution of expression values of a set of genes at the control state. To quantify the deviation from this baseline state, the difference in the distribution at any future time t and the control state is evaluated. To do so, we make use of the Kolmogorov-Smirnov (KS) Statistic
While it has been shown that if one takes a large enough set of genes the distribution of values is expected to follow the log-normal distribution
The task is to therefore identify a subset of genes such that the shift in an organism's state as it responds to the external stimulus is maximized. This plot depicts the distribution of expression values for the burn injury data. The plot is a modification of the one presented in
The sequence Δ(t) is defined as the transcriptional state of the system as it quantifies the level of transcriptional deviation from a control state. In order to evaluate a time-independent metric of the difference between two distributions any norm can be used on Δ(t). We opt for the simplest evaluation using the L1-norm. The use of the L1 norm, quantifies the deviation over all time points during the duration of the experimental protocol. Therefore, the scalar quantifying the difference between the two distributions is defined as:
Having defined a metric quantifying the deviation of the current state from a control, the selection step of the algorithm identifies a subset of motifs composed of genes whose transcriptional state is responsible for the maximum deviation from the control state Two interesting properties of the transcriptional state are worth exploring further. The first relates to the changing character of the transcriptional dynamics, i.e., the deviation from the control, as more and more clusters are added. Based on the hypothesis that the totality of the transcriptome hides the informative components of the response, one should expect (see
However, the identification of an informative subset of motifs represents a difficult combinatorial problem. Given that the number of possible motifs is defined as ABT, where T is the number of piecewise averaged time points, the number of combinations that need to be evaluated is computationally intractable. To compensate for the combinatorial nature of the problem, we propose two different methods for carrying out the selection of informative motifs. The first method which we propose is the use of a greedy algorithm
The advantage of utilizing a greedy selection lies in the fact that the combinatorial nature of the problem is ignored at the cost of finding a sub-optimal though possibly “good enough” solution.
An alternative method for selecting an optimal subset of motifs is to limit the set of motifs that will be considered. Thus, rather than considering a superset of ABT different motifs, we will limit our evaluation to over-populated motifs. Thus, by limiting the superset to only the over-populated motifs, the number of combinations that must be evaluated is decreased to a more tractable number. To perform this reduction, we define an over-populated motif as a motif whose population is greater than would be expected if the HOT SAX hashing algorithm were performed upon a randomly generated dataset which comprises the same number of probe sets and time points as the dataset being evaluated. After the initial set of motifs has been filtered, we generate all possible combination of motifs and evaluate them for the value of their transcriptional state, and like in the greedy selection, the set of motifs which yield the maximum transcriptional state will be identified as the informative subset. The advantage of utilizing this method is that unlike the greedy algorithm we can be sure that the set of motifs is indeed optimal rather.” However, while this filtering step has eliminated a large number of combinations that need to be considered, it still requires the evaluation of a large number of possible combinations and thus is computationally expensive.
Because the result of the HOT SAX algorithm itself depends upon the selection of the alphabet, AB, we further investigated how well the datasets correspond to the underlying exponential distribution as the value of AB is altered parametrically. Therefore, the previous evaluation as to whether a dataset consists of a significant perturbation was re-run by varying the AB parameter from 3 to 5 which are commonly used alphabet sizes. In
The distribution of hash values for the circadian and randomly generated datasets, appear to be drawn from an exponential distribution, whereas the chronic and circadian datasets are not. We select the alphabet size that shows the lowest similarity with the exponential distribution as the optimal. The burn dataset is evaluated for only alphabet sizes 4-5, because with an alphabet size of 3, the number of clusters (243) is too small for the exponential approximation to be used, whereas in the chronic dataset alphabet size of 5 was not considered because 49 millions clusters needed to be evaluated which is several orders of magnitude greater than the 15 thousand genes being evaluate.
From this behavior, we hypothesized that the selection of the AB parameter should aim to maximize the observed deviation. Thus to conduct the selection of informative motifs, we have elected to utilize the results from the parametric evaluation and select an AB of 3 for the corticosteroid datasets, and an AB of 4 for the circadian and burn datasets. Despite the fact that the circadian dataset does not illustrate any defining perturbation, the selection of AB of 4 allows us to maintain a consistency
For the optimal alphabet size, we evaluate the population distribution of the individual datasets. In
Dataset | Number of Possible Motifs | Number of Nonzero Motifs |
Synthetic Dataset | 19683 | 4918 |
Circadian Dataset | 65536 | 3898 |
Acute Corticosteroid Dataset | 19683 | 13320 |
Chronic Corticosteroid Dataset | 177147 | 7180 |
Burn Dataset | 1024 | 491 |