^{1}

^{2}

^{*}

^{3}

^{3}

^{1}

^{2}

Conceived and designed the experiments: AAS MC. Performed the experiments: AAS. Analyzed the data: AAS AV. Contributed reagents/materials/analysis tools: AV CAB. Wrote the paper: AAS MC.

The authors have declared that no competing interests exist.

We present an approach for answering similarity queries about gene expression time series that is motivated by the task of characterizing the potential toxicity of various chemicals. Our approach involves two key aspects. First, our method employs a novel alignment algorithm based on time warping. Our time warping algorithm has several advantages over previous approaches. It allows the user to impose fairly strong biases on the form that the alignments can take, and it permits a type of local alignment in which the entirety of only one series has to be aligned. Second, our method employs a relaxed spline interpolation to predict expression responses for unmeasured time points, such that the spline does not necessarily exactly fit every observed point. We evaluate our approach using expression time series from the E

We are developing an approach to characterize chemicals and environmental conditions by comparing their effects on gene expression with those of well characterized treatments. We evaluate our approach in the context of the E

Characterizing and comparing temporal gene expression responses is an important computational task for answering a variety of questions in biological studies. We present an approach for answering similarity queries about gene expression time series that is motivated by the task of characterizing the potential toxicity of various chemicals. Our approach is designed to handle the plethora of problems that arise in comparing gene expression time series, including sparsity, high-dimensionality, noise in the measurements, and the local distortions that can occur in similar time series.

The task that we consider is motivated by the need for faster, more cost-efficient protocols for characterizing the potential toxicity of industrial chemicals. More than 80,000 chemicals are used commercially, and approximately 2,000 new ones are added each year. This number makes it impossible to properly assess the toxicity of each compound in a timely manner using conventional methods. However, the effects of toxic chemicals may often be predicted by how they influence global gene expression over time. By using microarrays, it is possible to measure the expression of thousands of genes simultaneously. It is likely that transcriptional profiles will soon become a standard component of toxicology assessment and government regulation of drugs and other chemicals.

One resource for toxicology-related gene expression information is the E

(A) The curves show the actual hidden expression profile for each treatment, even though we must rely on the noisy sampled observations (the dots). (B) We have reconstructed the profiles at unobserved times, and used them to perform a similarity query. The highlighted areas represent possible good matches.

The computational task that we consider is illustrated in

There are several properties of the expression time series at hand that are important considerations for our work.

These properties of the data result in several additional challenges for the task we consider.

The time points present in a given query may not correspond to measured points in some or any of the time series in the database.

Queries may be of variable size. Some queries may consist of only a single observation, whereas others may contain multiple time points. Additionally, queries may vary in their extent: some may span only a few hours whereas others include measurements taken over days.

A given query and its best match in the database may differ in the amplitude, temporal offset, or temporal extent of their responses. For example, the expression profile represented by a query treatment may be similar to a database treatment except that the gene expression responses are attenuated, or occur later, or take place more slowly.

A given query and its best match in the database may differ in that one of them shows more of the temporal evolution of the treatment responses. In other words, the query may be similar to a truncated version of the database series, or vice versa.

To address these challenges, we have developed a generative model that approaches the problem from a probabilistic perspective. In order to temporally align gene-expression time series using our model, we employ a novel method for

Our time warping approach differs in several substantial ways from the standard dynamic programming method. Unlike the standard approach, our method does not force the two series to be globally aligned. Instead, it permits a type of

We also investigate variations on spline interpolation in order to find an approach that results in accurate reconstructions of sparsely sampled time series. We find that we achieve more accurate interpolations when using higher order splines. Further, our experiments indicate that it is helpful to relax the splines' fit to the observed data, rather than potentially overfitting by exactly intercepting each observed data point.

In earlier work, our group

Lamb et al.

Aach and Church

Bar-Joseph et al.

Listgarten et al.

A related approach to aligning time series is proposed by Gaffney and Smyth

Another similar approach is

Our approach is also related to various probabilistic sequence models, such as

In this section we detail our generative model for classifying and aligning time series, and present a dynamic programming algorithm that is able to find optimal alignments under this model. We also present a review of B-spline interpolation and discuss some useful variations of the method. We use spline interpolation to reconstruct unobserved microarray observations.

Our approach to answering similarity queries involves three basic steps: (i) we use interpolation methods as a preprocessing step to reconstruct unobserved expression values from our sparse time series; (ii) we use our alignment method to find the highest scoring alignment of the query series to each treatment series in the database; (iii) we return the treatment from the database that is most similar to the query, and the calculated alignment between the two series.

We have implemented all our algorithms in Java. The source code is available for download at

One challenge that arises when aligning a pair of expression time series is that the series may have been sampled at different time points. Moreover, the sampling may be sparse and occur at irregular intervals. To address these issues, we first use an interpolation method to reconstruct the unobserved parts of the time series before trying to align them. This interpolation step allows us to represent each time series by regularly spaced observations. We refer to the “observations” which come from the interpolation, as opposed to measurement, as

Although linear interpolation is a natural first approximation, other work has explored the use of B-splines to better reconstruct missing expression data

As shown in _{i}_{,k} is the

The main spline which fits the observed points is a weighted sum of the basis splines shown at the bottom of the figure. These are defined by the Cox-de Boor regression formulas (Equations 2 and 3) in conjunction with pre-defined points of discontinuity (the vertical lines). The weights, called control points, are easily obtained by solving a set of linear equations.

It follows that the segments of the

The weights _{i}_{i}_{i}

With fewer than

Unfortunately, B-splines have a tendency to overfit curves in data-impoverished conditions. Such reconstructions can show large oscillations in an attempt to exactly intercept every observed data point. This can be especially problematic with microarray data, which are already inherently noisy. The solution we use is to solve for the control points of a low-order spline, and then use those control points for a higher-order one. Such a spline will tend to fall within the convex hull created by the lower-order spline

Each possible alignment we consider for two given time series (the query and the database series) partitions the series into

The best alignment between the query treatment and the database treatment being considered involves three segments. The first two segments of the database treatment have increased amplitude, the first segment is contracted (or stretched in), and the third segment is stretched out in order to approximate the observed query treatment. Also the alignment shorts before the database treatment has ended, as there is no evidence that the query treatment expression has begun to increase again at the end.

To determine the similarity between a query time series

Given this generative process idea, we calculate the probability of a particular alignment of query _{i}_{i}_{i}_{i}

_{m}_{s}_{a}_{e}

To represent _{s}

We choose this distributional form because it is a variation of the log normal distribution that is symmetric around one, such that

We use a similar distribution to represent _{a}

To calculate _{e}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}l_{i}r_{i}l_{i}r_{i}_{i}_{i}

Our model for “generating” points in the query series from a point in the database series is a Gaussian centered at the database point. Let _{e}_{e}_{i}

In other words, we center a Gaussian on the expression level at the mapped time coordinate in the database series, and ask how probable the scaled expression value from the query looks at that time coordinate.

To generalize this calculation to multiple observations in the query series, we make the simplifying assumption that the observations are independent, and we have:_{i}

Each of our observations represents measurements for hundreds of genes. We therefore generalize the description above by having _{e}

We assume that _{e}

In addition to considering the likelihood of the query series under the assumption that it exhibits a similar response to the given database series, we also consider its likelihood under a null model. The notion of a null model here is one that generates alignments by randomly picking observations from the database to align with the query sequence. The rationale for using such a null model is analogous to the use of a model of

The value of a null model for our application is that it enables alignments of differing lengths, including shorted alignments, to be compared on an equal footing. Under our scoring function which incorporates the null model, segments have a positive score only if the database series in that segment explains the corresponding segment from the query series better than the null model does.

Let _{DB}_{e}_{DB}_{e}

We then estimate the probability of the

Since our null model assumes that there is only a single segment with no amplitude change or stretching, we can compute the probability of the entire query series

Putting together the terms above, we can score a given alignment based on the log of the likelihood ratio of the query series under the “database series” model versus the query series under the null model as:

Up to now we have described this process in terms of using a database series to generate the query series. However, we want our alignment method to be symmetric so that it does not matter which series we consider to be the query and which we consider to be from the database. Due to the last two terms, this will not necessarily be the case using the scoring function defined above. Therefore, we modify the scoring function so that it also considers using the query series to generate the database series:

Here _{e}_{i}_{i}_{i}_{i}_{e}_{i}_{i}_{i}_{i}_{i}_{i}

Given a pair of time series, we do not know a priori which alignment (i.e., placement of corresponding segments) is optimal. However we can find the optimal alignment using dynamic programming. The following algorithm takes as input two time series, termed

In particular, given a segment pair (_{i}_{i}

The arguments to this scoring function define the leftmost and rightmost time coordinates of the segments being aligned from the query series and the database series. These points are selected from the set of regularly spaced observations mentioned above. The stretching parameter, _{i}_{i}_{i}

The core of the dynamic program involves filling in a three-dimensional matrix Г in which each element

We define

Here, _{m}

Recall that we are interested in possibly shorting the alignment, thus finding a local alignment rather than a global one. Allowed alignments are those that explain the entire extent of at least one of the two given time series. In order to recover the optimal alignment, we use a traceback procedure that involves scanning the elements of Г that represent alignments that include the entirety of the query series, the entirety of the database series, or both. The procedure returns the alignment corresponding to the highest-scoring entry among these. More formally, we find the score of the best alignment as follows, and start the traceback from the identified element:

This dynamic program can be thought of as having three key “penalty terms” that determine the relative scores of alignments. These penalty terms correspond to the probability distributions that govern (i) the number of segments, (ii) the stretching values, and (iii) the amplitude values used in an alignment.

Preferences for the number of segments to be used in alignments are expressed by providing a distribution for _{m}_{a}

In this section we present experiments that evaluate the utility of our novel time warping method and spline models for the task of answering similarity queries with expression profiles.

The data we use in our experiments comes from the Edge toxicology database

Each observation is associated with a treatment and a time point. The treatment refers to the chemical to which the animals were exposed and its dosage. The time point indicates the number of hours elapsed since exposure occurred. Times range from 6 hours up to 96 hours. The data used in our computational experiments span 11 different treatments, and for each treatment there are observations taken from at least three different time points.

We can assume that for all treatments there exists an implicit observation at time zero. This is the time at which the treatment was applied, so all expression values are assumed to be at base level. Therefore every query automatically includes at least two observations: the actual query time(s) and the zero point. Thus earlier points in time can be interpolated, even when there seems to be only a single query observation.

Linear interpolation is used between these observations of 2,3,7,8-tetrachlorodibenzo-

Before we evaluate our generative alignment method, we wish to determine which type of spline (including simple linear interpolation) is the best to use in our preprocessing step. We do this by running a leave-one-out experiment in which we classify each observation in our data set in turn, using the remaining observations as the database. However, we exclude from the database any observation with the same treatment (i.e., chemical and dosage) and time as the query observation. We exclude from the queries observations from the last observed time of each treatment because we cannot interpolate pseudo-observations at these times when they are removed from the database series. We reconstruct hourly pseudo-observations for every treatment, using the different methods of interpolation. We search the reconstructed database for the pseudo-observation that is most like the query. We predict the query's treatment and time to be the same as this nearest neighbor. Notice that by excluding replicates of the query from the database, we are forcing our classifier to use interpolation in order to find the correct answer. We wish to know how accurately we are able to (i) identify the treatment from which each point was extracted, and (ii) align each query point to its actual time in the time series for the treatment. We refer to the former as

We note that this task is only a surrogate for the actual task with which we are concerned—classifying uncharacterized chemicals and aligning them with the most similar treatment in the database. It is a useful surrogate, however, because it is a task in which we know the most similar treatment and the correct alignment of the query to this treatment.

The metric we use to measure distance between the query observation and the database pseudo-observation being considered is a scale-independent Euclidean distance. The expression values of each database observation are all multiplied by a scalar, which is chosen via a least-squares method in order to minimize its distance to the query observation.

We consider seven different interpolation methods in all. We look at both

There are several advantages to using the observed times as the knots for our interpolating splines. First, it allows easy comparison to the basic linear interpolation control. Second, we assume that the data was taken at those times because interesting behavior was anticipated. Using them as knots allows our splines more flexibility there. Third, it keeps the linear equations from being rank-deficient as explained earlier. With uniformly spaced knots (as used by Bar-Joseph et al.

The results of this experiment are shown in ^{2} test. Highlighted points are those deemed significant, with

All replicates of the observation tested are purged from the database. The top line shows classification accuracy, in which the correct treatment is chosen. The bottom lines show alignment accuracy, where the predicted time is within 24 and 12 hours respectively of the actual time. Highlighted points are significantly different from the linear case (^{2} test).

Based on these results, we restrict our attention to smoothing splines in subsequent experiments.

We now turn our attention to evaluating our multisegment time series alignment algorithm. For all of the experiments reported in this section, we set the parameters of this method as follows. We set the probability that the model has one, two, or three segments at _{e}_{s}_{a}^{−1}. Thus the three main components of the model have roughly similar influence.

We assemble queries by randomly subsampling time series in our data set. We assemble ten such queries from each treatment. We build each query by first selecting the number of observations in it, then choosing which time points will be represented, and finally picking an observation for each of these time points. The query sizes are chosen from a uniform distribution that ranges from one up to the number of observed times in the given treatment. The maximum size of a query is eight, although most consist of four or fewer observations. The time points are chosen uniformly as are the observations for each chosen time.

We then classify and align the query using all the other observations as the database. We preprocess both the query and the eleven database treatments using smoothing splines to reconstruct pseudo-observations at every four hours (starting at time zero, when all expression values are at the basal level). As before, we use the highest interpolation order possible in cases where there are too few observations for the prescribed one. We then align the query against all eleven treatments using our method. We return the database treatment with the highest scoring alignment, as defined by Equation 14. Because the alignment also maps each query time to a database treatment time, we can find the temporal error for any query time point. We thus calculate the average temporal error for the times in the original query in order to assess alignment error.

We consider several other alignment methods as baselines. We term the first baseline

The second control is traditional Euclidean dynamic time warping _{i}_{j}_{i}_{j}

This makes it easy to compare warpings to different treatments, where one or the other dimension has been shorted.

Another control we consider is linear parametric warping. This is similar to the method explored by Bar-Joseph et al.

Finally, we consider

The results of these experiments are shown in ^{2} test. Likewise, the large square indicates a significant difference from the three-segment generative model.

The figure shows both when there is no temporal distortion (A), and when there is (B). The top lines represent treatment classification accuracy, while the bottom two lines add the criterion that the predicted times are within 24 and 12 hours respectively of the actual time, on average. Small highlights represent cases in which there is a significant difference in accuracy from the corresponding one-segment generative case (^{2} test), while the larger highlights show a significant difference from the three-segment model.

The one-segment and three-segment models are only significantly different from each other in a handful of cases. Because we have added no distortion to the queries, the one-segment model should be sufficient to explain them. We might expect to see some degradation when using the three-segment model, as it is allowed much more freedom in where it places its segments. However, it seems that this is not the case; the three-segment model results in slightly higher accuracies. One explanation for this result is that the spline preprocessing does not create perfect reconstructions of the missing data, and the more expressive three-segment model is better at compensating for this error. Of the control methods, only COW is competitive with our generative method. There is no significant difference between its accuracy and that of our method. Euclidean dynamic time warping classifies fewer queries correctly than our method, although those it does tend to be aligned correctly. This is probably because it has a strong bias toward performing little warping.

To better test the utility of the multisegment model, we next consider distorting the query time series temporally. We use three different distortions. The first one doubles all times in the first 48 hours (i.e., it stretches the first part of the series), and then halves all times (plus an offset for the doubling) for the next 24 hours. The second distortion halves for the first 36 hours and then doubles for 60 hours. The third one triples for the first 60 hours and then thirds for another 20. It should be noted that not all the treatment observations extend this long in time. The short ones (e.g., those for which we only have measurements up to 24 or 48 hours) will thus not be distorted as much as the long ones.

Aside from the distortion, we perform the same experiment as before. We show the results in

One concern is that by adding distortion we could be changing the best classification of a given treatment. For example, maybe we would distort 10 µg/kg of TCDD in exactly the right way to make it look like 64 µg/kg. To address this concern, we have performed similar distortion experiments in which we align a distorted query series only to the database series that was used to generate it. The results of this experiment are qualitatively the same as those reported in

We conduct further experiments to evaluate the importance of the stretching and amplitude components of our model. First, we conduct an experiment in which we effectively remove the amplitude component of our model by fixing the value of _{i}_{a}_{i}_{a}_{i}_{s}

The panels show distortion not present (A) and present (B). The first model is the three-segment generative model as before. The second disallows any amplitude changes at all, while the third allows any amplitude coefficient with no penalty to the score. Likewise, the fourth disallows stretching and the fifth allows any stretching without penalty. Highlights indicate a significant difference from the unaltered three-segment model (^{2} test).

Totally disallowing either stretching or amplitude changes has an overall deleterious effect on the accuracy of the alignments. However there seems to be little negative effect in allowing stretching and amplitude changes but not penalizing for greater values. These results imply that the stretching and amplitude components of the model are valuable, but that the accuracy of the alignments is relatively insensitive to the actual penalties selected.

We next consider a set of experiments in which we assess the accuracy of computed alignments as a function of the amount of data in the query. We restrict our experiments to a single treatment (41 observations of 1 µg/kg TCDD at eight time points), although other treatments yielded qualitatively similar results. We randomly pick out

We expect the alignment error to generally decrease as we increase the query size. We also expect the one-segment method to perform slightly better when there is no distortion, and the three-segment method to be preferable when there is. However this latter behavior could be confounded for small query sizes, where the three-segment model may not have enough data to determine the segment parameters.

The results when we interpolate with third-order splines are shown in

The results shown in (A) have no temporal distortion, while those shown in (B) do. The dotted line represents the one-segment model, and the solid line represents the three-segment model, using third-order smoothing splines. Cases in which the two have significantly different results (

We next consider the sensitivity of the accuracy of the multisegment method to the number of segments it is allowed to use in its alignments. We would like to know to what extent the alignment accuracy degrades as the method is allowed to use more segments than the optimal alignment requires. We conduct an experiment in which we vary the number of segments from one to five, with query sizes of only one, four, and eight. The results of this experiment are shown in

As before, the results in (A) have no temporal distortion while those in (B) do. From top to bottom, the lines of each panel show queries of size one, four, and eight, using third-order smoothing splines. Lines are highlighted in cases where adding a segment to the model makes a significant difference (

Again, we see that in the data-rich situation, the best models are those that closely approximate the number of segments needed to simulate the temporal distortion (or lack thereof) applied to the query. In data-poor situations, the alignments of the one-segment method are as accurate as multisegment alignments. Significantly, the accuracy of the multisegment method is quite robust when it is allowed to use more segments than necessary. This is important, as in practice we will not generally know the correct number of segments in order to find the best alignment of a query and its best matching series in the database.

Finally, we consider calculating the alignments for four treatments that we know are closely related.

The boxed numbers on each segment represent the amplitude coefficient by which the expression levels of the 10 µg/kg segment are best multiplied in order to obtain the corresponding expression levels for the other treatment.

These alignments illustrate several interesting phenomena. First, they indicate that the overall amplitude of the response increases along with the dose. This effect is illustrated by the boxed numbers on the segments in

We have presented an approach for answering similarity queries among gene expression time series, and aligning those queries in time. Our approach employs spline models to interpolate sparse time series, and a novel method for time warping. We have investigated our approach in the context of a toxicogenomics application in which we would like to know which treatments in a database of well characterized chemicals are most similar to a given query treatment.

The work we have presented features several novel aspects and contributions.

We have introduced a novel,

To account for the fact that we have sparse time series, we have investigated the use of a variant of B-splines we refer to as

We have empirically shown that our smoothing splines result in more accurate alignments than both conventional

We have empirically demonstrated that our generative alignment method generally produces more accurate alignments and treatment classifications than other commonly used alignment methods, including conventional dynamic time warping, linear parametric, and correlation optimized warping.

There are several avenues of future work we plan to pursue. One is to address the time complexity of our multisegment algorithm, which is ^{5}), where ^{2}). When the calculations are restricted to the so-called

In addition, we have made two independence assumptions that we plan to revisit in future research. First, we have assumed that each gene is independent of all the others given the model. We expect that representing some gene dependencies would lead to more accurate classifications and alignments. Second, we assume that the measurements at each time point are independent of each other time point. We plan to investigate a Markov-model like approach that represents dependencies between neighboring time points.