^{1}

^{4}

^{1}

^{2}

^{¶}

^{3}

^{4}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: JS ADG PCET BEH KALC. Analyzed the data: JS. Wrote the paper: JS BEH KALC. Designed and developed the statistical methodology: KALC BEH ADG JS. Implemented the statistical software: JS.

¶ The complete membership of the author group can be found in the Acknowledgments section.

Time course ‘omics’ experiments are becoming increasingly important to study system-wide dynamic regulation. Despite their high information content, analysis remains challenging. ‘Omics’ technologies capture quantitative measurements on tens of thousands of molecules. Therefore, in a time course ‘omics’ experiment molecules are measured for multiple subjects over multiple time points. This results in a large, high-dimensional dataset, which requires computationally efficient approaches for statistical analysis. Moreover, methods need to be able to handle missing values and various levels of noise. We present a novel, robust and powerful framework to analyze time course ‘omics’ data that consists of three stages: quality assessment and filtering, profile modelling, and analysis. The first step consists of removing molecules for which expression or abundance is highly variable over time. The second step models each molecular expression profile in a linear mixed model framework which takes into account subject-specific variability. The best model is selected through a serial model selection approach and results in dimension reduction of the time course data. The final step includes two types of analysis of the modelled trajectories, namely, clustering analysis to identify groups of correlated profiles over time, and differential expression analysis to identify profiles which differ over time and/or between treatment groups. Through simulation studies we demonstrate the high sensitivity and specificity of our approach for differential expression analysis. We then illustrate how our framework can bring novel insights on two time course ‘omics’ studies in breast cancer and kidney rejection. The methods are publicly available, implemented in the R CRAN package

Over the past decade, the use of ‘omics’ to take a snapshot of molecular behaviour has become ubiquitous. It has recently become possible to examine a series of such snapshots by measuring an ‘ome’ over time. This provides a powerful tool to study stressor-induced molecular behaviour [

Robust and powerful analysis tools are critical for capitalizing on the wealth of data to answer key questions about system response and function. In addition to addressing the high-dimensionality of the data, such tools must account for a high number of missing values, and also variability within and between studied subjects. Many methods are limited by scale, and are unable to handle either a large number of time points, a varying number of time points per subject [

The benefit of decreasing the number of profiles analyzed via filtering is evident when considering the scale of typical time course ‘omics’ experiments. Tens of thousands of molecules can be measured at different time points, requiring multiple hypothesis tests to determine differential expression. While the false positive rate can be controlled using multiple testing corrections (e.g., FDR; [

A popular modelling approach for time course data is smoothing splines, which use a piecewise polynomial function with a penalty term [

After the filtering and modelling steps, the resulting summarized profiles can be clustered to gain biological insight from their similarities. Indeed, clusters of correlated activity patterns may predict putative functions for molecules and reveal stage- and tissue-specific regulators [

Hypothesis testing can also be performed within the mixed effect model framework to gain biological insight from differences between groups and across time. Several methods have been proposed which can all handle missing data and different numbers of replicates per time point, but are often limited when only a few time points are observed, as is typically the case for costly high-throughput experiments. Approaches such as linear models for microarray data (LIMMA; [

In this paper we propose a novel framework for time course ‘omics’ studies which is summarized in

The proposed framework consists of three stages: quality control and filtering; serial modelling of profiles; and analysis with clustering to identify similarities between profiles or with hypothesis testing to identify differences over time, between groups, and/or in group and time interactions.

We first applied the filtering and modelling stages of our framework to two publicly available transcriptomics datasets, which are briefly described below. The main analyses and biological interpretations were then performed on two proteomics datasets from breast cancer and kidney rejection studies.

The evolutionary principles of modular gene regulation in yeast were investigated by [

The anti-tumour efficiency of a chemotherapeutic drug on bone marrow in mice was investigated by [

Proteomic changes in MCF-7 cells resulting from insulin-like growth factor 1 (IGF-1) stimulation were investigated by [_{2} fold changes for time points 6, 12 and 24 h relative to baseline (0 h) were reported for 264 proteins with minimum two measured replicates. We applied our full data-driven modelling approach to this dataset, finishing with cluster analysis to explore patterns of protein response to IGF-1 stimulation.

The PROOF Centre of Excellence performed a longitudinal study to identify diagnostic biomarkers in blood plasma to predict acute renal allograft rejection [

For each of six different scenarios varying noise levels and fold changes, we simulated 100 datasets, each consisting of 140 profiles, 50 of which were differentially expressed. For each dataset, we applied our differential expression approach and LIMMA [

Filtering on the overall standard deviation of molecule expression is a common approach in static gene expression experiments to remove non-informative molecules prior to analysis [

Rather than the overall standard deviation, defined below as _{M}, we considered two filter ratios based on the standard deviations across time and subjects. These estimates can be used to identify low quality and/or non-informative profiles. Let _{i}(_{T} the average of standard deviations (SD) computed per time point with
_{I} is the average of SDs computed per subject, with
_{M} is the SD for each molecule, over all subjects and time points:
_{T} and _{I} as
_{T}, for quality control. The first type of profile consists purely of noise, resulting in _{T} ≈ _{M} and therefore _{T} ≈ 1. The second type of profile has a true signal over time, resulting in _{M} greater than _{T} and _{T} < 1. Hence, _{T} provides one means of discriminating between non-informative and informative profiles. We generally expect subject-specific profiles to be close to the mean molecule profile, resulting in _{I} ≈ 0, as would also be true for noisy profiles over time. Therefore, on its own, _{I} is only a good discriminator of unambiguously flat profiles, for which _{I} may often be smaller than _{M}, resulting in _{I} > 0. Nevertheless, the combination of both _{T} and _{I} can provide additional insights into the variance structure of the molecules and can guide the user to make more informed choices about filter ratio thresholds as illustrated in our case studies.

Profiles changing over time (blue) have a mean of the standard deviations per time point (_{T}) smaller than the mean of the standard deviations per molecule (_{M}), while these means have similar values for noisy molecules (brown). In both cases the mean of the standard deviations per subject (_{I}) is similar to _{M}.

During our filtering stage, we first removed molecules with more than 50% missing data and applied model-based clustering (R package _{T} and _{I} by specifying two clusters. Based on the rationale described above, we expect the cluster of profiles with low _{T} and _{I} to be informative and propose to discard profiles in the cluster with high _{T} and _{I}. In the specific case where a time course study includes the comparison of multiple conditions or treatments, it is important to avoid filtering profiles which may be non-informative within a condition but are differentially expressed between conditions. Therefore, we propose to apply the filtering approach to each condition separately, with the additional requirement that profiles must be found non-informative in all conditions in order to be removed.

In high-throughput experiments, thousands of molecule profiles need to be modelled in an efficient manner. Biological variability both between and within subjects must be accounted for, and experimental procedures typically result in different numbers of replicated measurements per molecule and time point. The combination of all of these factors requires a flexible, robust model-fitting procedure which can easily accommodate different sources of variation.

The first model assumes the response is a straight line and is not affected by subject variation. For each molecule, we denote by _{ij}(_{ij}) its expression for subject (or biological replicate) _{ij}, where _{i}, _{i} is the number of observations for subject _{ij}(_{ij}) on time _{ij}, where the intercept _{0} and slope _{1} are estimated via ordinary least squares:

As nonlinear response patterns are commonly encountered in time course biological data [_{1}, …, _{K} in the range of {_{ij}}, some unknown coefficients _{k} to be estimated, an intercept _{0} and a slope _{1}. That is,
_{1}…_{K} at quantiles of the time interval of interest.

In order to account for subject variation, our third model _{i} to the mean response _{ij}). Assuming _{ij}) to be a fixed (yet unknown) population curve, _{i} is treated as a random realization from an underlying Gaussian distribution independent from the previously defined random error term _{ij}. Hence, the subject-specific curves are expected to be parallel to the mean curve as we assume the subject-specific random effects to be constant over time:

A simple extension to this model is to assume that the subject-specific deviations are straight lines. Our fourth model therefore fits subject-specific random intercepts _{i0} and slopes _{i1}:

Clustering of time profiles allows insight into which molecules share similar patterns of response, which may in turn indicate a shared biological basis. Similarities between trajectories may be seen not only in terms of shape and magnitude, but also rates of change, or speed. However, detecting these similarities can be challenging due to noise and missing values in subject-specific measurements. Hence, the choice of modelling approach often has critical impact on the ability to identify clusters of biologically similar molecules.

We compared our modelling approaches LMMS and DLMMS to two single-step models using the workflow shown in

Trajectories derived from Linear Mixed Model Spline (LMMS) and Derivative Linear Mixed Model Spline (DLMMS) were compared to trajectories derived either from the mean or Smoothing Splines Mixed Effects (SME) models. Five clustering algorithms—hierarchical clustering (HC), kmeans (KM), Self-Organizing Maps (SOM), model-based (model) and Partitioning Around Medoids (PAM) were then applied on modelled trajectories using a range of two to nine clusters. The performance of each algorithm was assessed using the Dunn index. Gene Ontology (GO) term enrichment analysis was performed on each of the obtained clusters.

For clustering, we compared the performance of five algorithms using the Dunn index [

We selected clustering algorithms for comparison based on representatives of different classes of standard techniques: a model-based algorithm (

A size-based Gene Ontology (GO) term enrichment analysis was then performed to validate the biological relevance of each cluster, using the hypergeometric distribution based on the number of molecules in the domain of interest [

While cluster analysis can provide valuable insight into behaviour patterns common to groups (clusters) of molecules, differential expression analysis in a time course experiment can highlight significant responses to perturbations of each molecule. Our LMMS framework enables assessment of the significant differences over time or between individual groups based on the whole molecular trajectory instead of analysing individual time points.

_{i} denoting the group for each subject _{ir} to be the indicator for the ^{th} group, that is, _{ir} = 1 if _{i} = _{hi} in the full LMMSDE model is given by:
_{0} = _{0r} are the differences in intercept between each group and the first group; _{1} = _{1r} are the differences in slope between each group and the first group; and _{rk} are the differences in spline coefficients between each group and the first group.

We can test different hypotheses depending on which parameters are equal to zero. Firstly, for a single group, ∀_{ir} = 0, and time effects will be detected only if the goodness of fit of this model is better than the null model which fits only the intercept. Secondly, to detect differences between groups, we set _{1} = 0 and _{1} = 0, and test a goodness of fit against the null model which also has _{ir} = 0. Finally, if we include all parameters we can model the group * time interactions, by allowing different slopes and intercepts in the different groups. We compare this to the null model where the effects over time do not differ between groups. For each case we compared the fit of the expanded model from

We considered the performance of our filtering procedure in both proteomics and transcriptomics datasets. On the iTraq breast cancer (_{T} and _{I} ratios, and a second cluster with high values for the two ratios. We therefore removed the molecules from that second cluster. Similar types of clusters were observed for all transcriptomics datasets.

Scatterplots of filter ratios _{T} on the x-axis against _{I} on the y-axis for

In total, between 35% and 76% of the data were removed (_{T} values large p-values. We can explain the large p-values for low _{T} in _{T} and _{I} values (

The filter ratios _{T} and _{I} were calculated for every molecule. Colors in

The number (proportion) of profiles modelled with each model selected by our proposed LMMS approach. Models are abbreviated as linear (LIN), spline (SPL), subject-specific intercept (SSI), and subject-specific intercept and slope (SSIS). Models were applied to cell line breast cancer data (Cell), _{T} and _{I}.

Model | Cell | Yeast | Mouse | Human |
---|---|---|---|---|

LIN |
93 (.55) | 125 (.035) | 205 (.1) | 3 (.091) |

SPL |
75 (.45) | 3427 (.95) | 1769 (0.87) | 3 (.091) |

SSI |
30 (.008) | 56 (.028) | 10 (.3) | |

SSIS |
2(.0005) | 3 (.002) | 17 (.51) | |

# Modelled | 264 | 3586 | 2033 | 33 |

% Removed | 36 | 35 | 67 | 76 |

The power of our LMMS modelling lies in its ability to adaptively fit the complexity of the data. Since some molecules are more prone to subject-specific variations than others, we generally expect that a single model will be insufficient to appropriately model all types of trajectories. We illustrate our point through the application of LMMS to datasets with increasing organism complexity, from cell lines measured in a controlled environment to

We compared clustering of profiles from the iTraq breast cancer dataset which had been modelled with mean, SME, LMMS and DLMMS (

Clustering was performed on the summarized profiles obtained from _{2} transformed protein abundance.

This criterion resulted in different selections for these two quantities for the four modelling approaches (

We subsequently assessed the biological relevance of the proteins identified within each cluster with a GO term enrichment analysis. After removal of GO terms that contained only one molecule, we identified 62 unique enriched GO terms (adj. p-value ≤ 0.05,

Interestingly, among the enriched GO terms identified by LMMS or DLMMS we observed biological processes involved in glucose metabolic processes (GO:0006006), glycolysis (GO:0006096) and gluconeogenesis (GO:0006094). These processes play an important role in cancer progression [

We compared the proposed LMMSDE with LIMMA on the unfiltered simulated data with varying expression patterns and levels of noise. For each scenario, we recorded how many of 50 differentially expressed molecules were detected as significant after correction for multiple testing and calculated average sensitivity and specificity over all 100 replicates (

Averaged sensitivity for LMMSDE and LIMMA after 100 simulations. Differential expression between groups and/or time was tested with increasing noise and fold change (FC) levels.

Effect | Noise | FC | LMMSDE | LIMMA | Effect | Noise | FC | LMMSDE | LIMMA | Effect | Noise | FC | LMMSDE | LIMMA |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1.25 | 0.877 | 0.793 | 1.25 | 0.98 | 0.85 | 1.25 | 0.96 | 0.927 | ||||||

1 | 1.5 | 0.981 | 0.963 | 1 | 1.5 | 0.997 | 0.976 | 1 | 1.5 | 0.993 | 0.987 | |||

time | 2 | 0.997 | 0.992 | group | 2 | 0.999 | 0.995 | group | 2 | 0.999 | 0.998 | |||

1.25 | 0.044 | 0.019 | 1.25 | 0.66 | 0.053 | * time | 1.25 | 0.347 | 0.124 | |||||

3 | 1.5 | 0.667 | 0.354 | 3 | 1.5 | 0.943 | 0.494 | 3 | 1.5 | 0.845 | 0.703 | |||

2 | 0.939 | 0.838 | 2 | 0.986 | 0.881 | 2 | 0.965 | 0.938 |

We performed a differential expression analysis on the iTraq kidney rejection dataset to illustrate our LMMSDE analysis on complex and real data. In addition to applying the differential expression approaches LIMMA and LMMSDE on the full data set as in the simulated case study, we also applied our filtering approach for multiple conditions and removed profiles that were identified as non-informative in both conditions (64% of profiles were removed) before LMMSDE analysis. Filtering before differential expression analysis was only applied for LMMSDE, since removal of non-informative profiles should increase statistical power without biasing results. In contrast, filtering before LIMMA analysis affects posterior estimates and can bias p-values.

We compared LMMSDE and LIMMA in terms of the number of proteins declared as differentially expressed between the two groups and investigated their biological relevance with respect to the biological questions from the study. Two analyses were performed: to identify the molecules with significant differences between groups, and to identify molecules showing significant group*time interactions leading to different trends between the two groups over time. While no differentially expressed molecules were identified by LIMMA for either group or interaction effects, LMMSDE identified 35 differentially expressed proteins with a group effect and 12 proteins with a significant interaction effect (FDR adjusted p-value < 0.05). On the filtered dataset LMMSDE identified 13 molecules with a significant group effect and nine molecules with a significant interaction effect. Note that these differentially expressed proteins were also identified in the analysis of the full dataset. The effect size of differential proteins identified with both group and interaction effects tended to be small, with a magnitude of average fold change of < 1.5.

For the 13 (three not annotated) molecules that were declared as differentially expressed between groups, the top enriched biological process (

GO term enrichement analysis based on the proteins identified by LMMSDE as differentially expressed between Allograft Rejection (AR) and Non-Rejection (NR) patients after filtering using a 2-cluster model-based clustering based on _{T} and _{I}. The top GO biological processes are listed along with their FDR adjusted p-value and log odds ratio (OR).

GO | GO Description | adj. p-value | log(OR) |
---|---|---|---|

GO:0010951 | negative regulation of endopeptidase activity | 4.30e-03 | 5.57 |

GO:0006956 | complement activation | 6.40e-03 | 6.45 |

GO:0006958 | complement activation, classical pathway | 6.40e-03 | 6.36 |

GO:0002576 | platelet degranulation | 8.30e-03 | 5.67 |

GO:0045471 | response to ethanol | 8.30e-03 | 5.55 |

GO:0042593 | glucose homeostasis | 8.80e-03 | 5.43 |

GO:0006935 | chemotaxis | 1.00e-02 | 5.14 |

GO:0007596 | blood coagulation | 1.00e-02 | 3.84 |

GO:0030168 | platelet activation | 1.70e-02 | 4.30 |

Out of the nine molecules (1 not annotated) with a significant interaction between group and time, the most promising protein differentially expressed was IQ calmodulin-binding motif-containing protein 1 (IQCB1). This protein is particularly relevant to this study, as it is a nephrocystin protein localized to the primary cilia of renal epithelial cells. Mutations in this gene were shown to be strongly associated with Senior-Løken-Syndrome Type 5, a disorder causing nephronophthisis and renal failure [

Thus far, very few methods have been developed to analyse high-throughput time course ‘omics’ data. Statistical analysis is challenging due to the high level of noise relative to signal in such data, and the time measurements add an extra dimension of variability both within and among subjects. Our data-driven approach focuses on magnifying the inherent signal, by removing non-informative profiles that potentially interfere in downstream analysis, and by using a linear mixed model spline framework to account for subject-specific variability. This procedure provides clearer signals in both clustering and differential expression analysis.

The filtering of non-informative profiles is an important first step in analysis, as such profiles otherwise introduce noise and reduce statistical power in downstream clustering and differential expression steps [_{T} and _{I} with the test statistics from differential expression analysis over time.

For multiple treatment groups, we filtered separately for each group, removing only molecules identified as non-informative in both groups. An alternative option would be to calculate the ratios for each group separately, but apply the model-based clustering on all ratios from all groups. We found very little differences compared to a filtering approach applied on each treatment group. Using one of these approaches, it is possible that molecules that vary between groups, but show little change over time could be removed. However, these molecules, though differentially expressed, would be detected in a cross-sectional study, and are most likely not of primary interest in time course studies where the focus is on molecules changing expression over time.

In spite of the clear relationship between differential expression and filter ratios, we found the selection of thresholds to be challenging. Threshold choice can be affected by a variety of issues such as level of missing data and the number of replicates at each time point. In our analysis, we applied 2-cluster model-based clustering on the ratios to discriminate informative from non-informative profiles. However, we suggested guidelines to address these issues and our R package

Current modelling approaches for time course data fit the same statistical model to each molecule, allowing for either subject-specific intercepts [

In this study we clustered time course data based on their summarized profiles to identify groups of molecules representing relevant molecular processes. We did not consider here clustering of subjects to identify groups with similar sub-phenotypes. However, similar approaches can be applied to this alternate biologically interesting question [

Clustering analysis relies not only the choice of algorithm, but also on the number of clusters and the distance metric. There are a variety of options available for all of these, but we have focused on common choices in this study, and expect that other options would produce similar results. We observed that application of different modelling approaches (e.g. mean, SME, LMMS) resulted in different input data structure to the clustering algorithms. As clustering outputs are highly dependent on the input data structure [

Differential expression analysis is often based on an underlying model of the data which attempts to explain changes over time, between group, and through interactions while simultaneously accounting for noise in the data. We compared an approach based on linear models, LIMMA, with our approach, LMMSDE, which is based on our linear mixed model spline framework. An alternate spline-based approach is EDGE [

An additional benefit of LMMSDE was the ability to first perform filtering, which reduced the number of tests performed and increased our ability to detect truly differentially expressed molecules. The same type of analysis could not be performed with LIMMA, as its test statistic is based on an empirical Bayes approach using posterior estimators for degree of freedom and standard deviation. Therefore, a filtering of low variance molecules would affect posterior estimates [

We proposed a novel framework for analysing time course ‘omics’ data, unifying quality control and filtering, modelling, and analysis in a linear mixed model spline framework. The first step ensures the reproducibility and interpretability of the data. The second step is a highly flexible data-driven approach aimed at modelling high-throughput data with potentially different noise levels and trajectories over time. It can handle missing values, has low computational burden, and avoids arbitrary input parameters. In the third step, similarities between profiles can be assessed through clustering, or differences over time and between groups can be assessed through LMMSDE. The unification of our modelling with clustering led to the identification of biologically relevant profile clusters. The unification of our modelling with differential expression analysis outperformed LIMMA in the situations of high noise levels and low fold changes. In application of LMMSDE to real data, this higher sensitivity resulted in novel identification of differentially expressed molecules biologically relevant to kidney rejection. The LMMS framework is implemented in the R package

The noise level is equal to that in the kidney rejection data and the groups of each individual are indicated in grey full lines (group 1) or black dashed lines (group 2). In

(PDF)

Filter ratios _{T} (x-axis) and _{I} (y-axis) are shown for: simulated data (_{10} p-values for Linear Mixed Model Spline for Differential Expression analysis (LMMSDE) test for differential expression over time (first column) and the proportion of missing values (second column).

(PDF)

using the mean (

(PDF)

Venn diagram of significantly enriched GO terms identified by clustering of the mean, Smoothing Splines Mixed Effects (SME), Linear Mixed Model Spline (LMMS) and Derivative LMMS (DLMMS) before (

(PDF)

Shown are the GO terms identified concordantly by clustering of at least two of the modelling approaches (Linear Mixed Model Spline (LMMS), Derivative LMMS (DLMMS), mean or Smoothing Splines Mixed Effects (SME)).

(PDF)

Enriched GO terms uniquely identified by clustering of the profiles modelled by the different approaches considered. For each enriched term, the cluster number (Cluster), the number of molecules with GO terms in that cluster (Counts), the number of molecules in the data with that GO term (NMol), the number of molecules in the cluster (Size), the GO description, ontology (Ont), false discovery rate adjusted p-value (adj. p), and log odds ratio (OR) are given. The table is sorted by p-value within each cluster. Linear Mixed Model Spline (LMMS); Derivative LMMS (DLMMS) and Splines Mixed Effects (SME) use hierarchical clustering while the mean uses PAM clustering. For LMMS three clusters were identified, while two clusters were identified for DLMMS, mean and SME.

(PDF)

The authors would like to thank the NCE CECR Prevention of Organ Failure (PROOF) Centre of Excellence team (Vancouver, British Columbia, Canada), and in particular the Principal Investigators Scott J. Tebbutt (Department of Medicine, Institute for HEART + LUNG Health, University of British Columbia), Bruce M. McManus, (Department of Pathology and Laboratory Medicine, Institute for HEART + LUNG Health, University of British Columbia), Paul Keown, (Department of Medicine, University of British Columbia), Rob McMaster (Department of Medical Genetics, University of British Columbia) and Raymond T. Ng, Department of Computer Science, University of British Columbia) for making the kidney data available to us.