DTW-MIC Coexpression Networks from Time-Course Data

When modeling coexpression networks from high-throughput time course data, Pearson Correlation Coefficient (PCC) is one of the most effective and popular similarity functions. However, its reliability is limited since it cannot capture non-linear interactions and time shifts. Here we propose to overcome these two issues by employing a novel similarity function, Dynamic Time Warping Maximal Information Coefficient (DTW-MIC), combining a measure taking care of functional interactions of signals (MIC) and a measure identifying time lag (DTW). By using the Hamming-Ipsen-Mikhailov (HIM) metric to quantify network differences, the effectiveness of the DTW-MIC approach is demonstrated on a set of four synthetic and one transcriptomic datasets, also in comparison to TimeDelay ARACNE and Transfer Entropy.


Introduction
Inferring a biological graph (e.g., a Gene Regulatory Network) from high-throughput logitudinal measurements of its nodes is nowadays one of the major challenges in computational biology, and several are the proposed solutions to this still unanswered question [De Smet and Marchal, 2010;Marbach et al., 2010].Although the presence of many indirect interactions set the problem in a strongly non-linear domain, simple basic approaches such as the coexpression networks via correlation measures proved to work reasonably well [Allen et al., 2012].However, Pearsons correlation lacks sensitivity in case of non-linear relations between signals and when signals are mutually delayed, thus the reliability of a coexepression network would benefit from being built by a measure taking care of these issues.A first example in this direction can be found in [ElBakry et al., 2010], where the authors envisioned an ad-hoc technique to cope with the time lag.
Here we follow a different approach introducing a novel similarity measure DTWMIC, whose components are the Dynamic Time Warping (DTW, [Sakoe and Chiba, 1978]) and the Maximal Information Coefficient (MIC, [Reshef et al., 2011]).While DTW can take care of the time-shifts, the MIC metric can detect functional non-linear interactions between the signals.Moreover, the presence of MIC alleviates the limitations affecting DTW when signals, although related, present different levels of expression.To support the claim of better reconstruction performances (in terms of smaller HIM distances from the gold standard), we use three synthetic datasets generated by GeneNetWeaver (GNW, [Schaffter et al., 2011]) following the guidelines of the DREAM Challenge [Prill et al., 2010], and a transcriptomic dataset on the expression response of human T cells to PMA and ionomicin treatment [Rangel et al., 2004].

Coexpression Networks
Correlation methods like WGCNA [Zhang and Horvath, 2005] represent the most direct approach to the exploration of the gene co-expression network.The adjacency matrix is defined by computing a correlation function (absolute Pearson for WGCNA) between all pairs of gene signals G i , G j , soft thresholded (for instance by a power function) to determine the biological meaningfulness of the connections: These co-expression-based methods have been used in several studies and have shown their usefulness in interpreting biological results and identifying important gene modules.Therefore, we will use WGCNA as a reference method for the correlation-based approach.
To take care of non-linear interactions and time shifts in the gene signals, we propose to employ the DTWMIC function: In all experiments we set β = 6 (as in the WGNCA approach) for both Pearson and the DTW-MIC measures.
Comparison with gold standard network is quantitatively assessed in terms of HIM network distance.Details on DTW, MIC and HIM are provided hereafter.

Maximal Information Coefficient
The Maximal Information Coefficient (MIC) measure is a component of the Maximal Information-based Nonparametric Exploration (MINE) family of statistics, introduced in [Reshef et al., 2011;Speed, 2011], for the exploration of two-variable relationships in multidimensional data sets.having the generality and equitability property.The MIC value is obtained by builiding several grids at different resolutions on the scatterplot of the two variables and computing the largest possible mutual information achievable by any grid applied to the data, and then normalizing to the [0, 1] range.
The two distinctive features of MIC are generality, i.e., the ability of capturing variable relationships of different nature and equitability, that is the property of penalizing similar levels of noise in the same way, regardless of the nature of the relation between the variables.MIC can be computed in R by using the minerva package [Albanese et al., 2012].

Dynamic Time Warping
The Dynamic Time Warping (DTW) [Keogh and Pazzani, 1998;Keogh and Pazzani, 2000] is a measure of distance between two sequences considering occurring time shifts between the series.Thus it proves more suitable than the Euclidean metric in curve comparison because it takes into account the shapes of the curves instead of just evaluating the pointwise distance of the vectors.
The DTW algorithm also finds an optimal match between the two given series by non-linearly warping them in the time dimension to determine a measure of their dissimilarity, stretching (or compressing) the time axis.As a comprehensive reference, the reader is referred to [Gusfield, 1997].
DTW elective application is the comparison of different speech patterns in automatic speech recognition [Sakoe and Chiba, 1978], but it has been also used in functional genomics [Aach and Church, 2001;Furlanello et al., 2006].To obtain a similarity measure, we use the function DTW s = 1/(1 + DTW d ), where DTW d is the normalized distance between two series, as computed in the R package dtw [Giorgino, 2009].

Hamming Ipsen Mikhailov Distance
The HIM distance for network comparison is defined as the product metric of the Hamming distance H [Tun et al., 2006;Dougherty, 2010] and the Ipsen-Mikhailov distance IM [Ipsen and Mikhailov, 2002], normalized by the factor √ 2 to set its upper bound to 1: for N 1 , N 2 two undirected (possibly weighted) networks.
The drawback of edit distances (such as H) is locality, as it focuses only on the network parts that are different in terms of presence or absence of matching links [Jurman et al., 2011].Spectral distances like IM are global, since they take into account the whole graph structure, but they cannot distinguish isomorphic or isospectral graphs, which can correspond to quite different conditions within the biological context.
Thus the HIM distance is a possible solution for both issues: details about HIM and its two components H and IM are given in [Jurman et al., 2012].

GeneNetWeaver data
GNW allows the construction of DREAM-like networks and corresponding simulated expressions (steady states and time course) generated as subnetworks of the E. coli [Gama-Castro et al., 2008] or the Yeast [Balaji et al., 2006] transcriptional regulatory network, with the possibility of choosing a number of parameters both for the subnetwork and the data structure.In particular, we generated three synthetic networks Yeast 20 , Ecoli 20 , Ecoli 50 , where the subscript indicates the number of nodes of the the net, extracted randomly from the whole set of nodes.In each case, half of the selected nodes are regulators.
For each datasets, we generated 10 replicates of logitudinal datasets, ccreated by a dynamic model mixing ordinary and stochastic differential equations, with 41 time points equally spaced between time 0 and time 1000 and affected by 0.5% (Yeast) or 1% noise (E.coli).In each dataset, the first half of the time series shows the response of the network to a perturbation (at t=0 is the wild-type steadystate), then the perturbation is removed and the second half of the time series shows how the gene expression levels go back from the perturbed to the wild-type state.(this is the DREAM4 setup for the expression time course in GNW).
For each experiment, we built the corresponding Pearson and DTWMIC coexpression networks, and we computed the HIM distance of both the inferred networks from the GNW gold standard.The results are reported in Table 1 and they show that the DTWMIC networks are consistently closer to the gold standard than the corresponding Pearson WGCNA graph.
For the Yeast 20 dataset, we further generated 4 time course datasets of the same length as above, but with a dual gene knockout.Also in this configuration, the DTWMIC inferred networks were closer to the gold standard: in the 4 experiments, their HIM distances were 0.23, 0.21, 0.24 and 0.21, while for the Pearson correlation networks the corresponding values were 0.30, 0.27, 0.31 and 0.47 respectively.

# exp
Yeast  The gold standard has unweighted links, and the node color is red for degree > 10, orange for 5 < degree < 10 and yellow for node degree < 5.For the Person correlation network we show only the links whose weight is larger than 0.05; nodes are red when degree > 20, orange for 10 < degree < 20 and yellow for degree < 10.For the DTWMIC network we show only the links whose weight is larger than 0.15; node colors are the same as in the Pearson net.

T-cell data
In the paper [Rangel et al., 2004], the authors investigated the response of a human T-cell line (Jirkat) to a treatment with PMA and ioconomin by measuring the expression of 58 genes across 10 time points (0,2,4,6,8,18,24,32,48, and 72 hours after treatment) with 34 replicates (data available in the R package longitudinal).Opgen-Rhein and Strimmer in [Opgen-Rhein and Strimmer, 2006b;Opgen-Rhein and Strimmer, 2006a] constructed the corresponding network by shrinkage estimation of the (partial) dynamical correlation, which we consider here as the ground truth network, displayed in Figure 1(a).We then built, starting from the same dataset, the correlation networks inferred by Pearson and DTWMIC, plotted respectively in Figure 1, panels (b) and (c) respectively.Again, the DTWMIC network results closer to the gold standard than the Pearson net, with a HIM value of 0.164 versus 0.214.
Moreover, it is worthwhile noting what happens considering separately the two components of the HIM distance in this case: while the Ipsen Mikhailov distance is still smaller for the DTWMIC network (IM=0.203vs. IM=0.296),the Hamming distance is larger (H=0.112 vs. H=0.06).This yields that there is a smaller number of links changing between the Pearson coexpression network and the gold standard, but these changing links induce a strongly different structure between the two graphs.

Conclusion
We introduced here DTWMIC, a novel measure for inferring coexpression networks from longitudinal data as an alternative to the absolute Pearson correlation used in the WGCNA approach.Due to the nature of its components Dynamic Time Warping and Maximal Information Coefficient, the DTWMIC similarity can overcome the well known limitations of the Pearson correlation when dealing with horizontally displaced signals and indirect interactions.Experiments on biologically inspired synthetic data and gene expression time course show the higher precision in the network inference achieved by DTWMIC with respect to the Pearson correlation in different conditions.

Figure 1 :
Figure 1: The T-cell network: gold standard (a), Pearson correlation network (b) and DTWMIC coexpression network (c).The gold standard has unweighted links, and the node color is red for degree > 10, orange for 5 < degree < 10 and yellow for node degree < 5.For the Person correlation network we show only the links whose weight is larger than 0.05; nodes are red when degree > 20, orange for 10 < degree < 20 and yellow for degree < 10.For the DTWMIC network we show only the links whose weight is larger than 0.15; node colors are the same as in the Pearson net.

Table 1 :
HIM distances of the DTWMIC (D) and the WGCNA Pearson (P) networks for all experiments on the GNW datasets Yeast 20 , Ecoli 20 , Ecoli 50