## Figures

## Abstract

Circadian oscillation in baseline gene expression plays an important role in the regulation of multiple cellular processes. Most of the knowledge of circadian gene expression is based on studies measuring gene expression over time. Our ability to dissect molecular events in time is determined by the sampling frequency of such experiments. However, the real peaks of gene activity can be at any time on or between the time points at which samples are collected. Thus, some genes with a peak activity near the observation point have their phase of oscillation detected with better precision then those which peak between observation time points. Separating genes for which we can confidently identify peak activity from ambiguous genes can improve the analysis of time series gene expression. In this study we propose a new statistical method to quantify the phase confidence of circadian genes. The numerical performance of the proposed method has been tested using three real gene expression data sets.

**Citation: **El Anbari M, Fadda A, Ptitsyn A (2015) Confidence in Phase Definition for Periodicity in Genes Expression Time Series. PLoS ONE 10(7):
e0131111.
https://doi.org/10.1371/journal.pone.0131111

**Editor: **Ying Xu,
University of Georgia, UNITED STATES

**Received: **January 13, 2015; **Accepted: **May 28, 2015; **Published: ** July 10, 2015

**Copyright: ** © 2015 El Anbari et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

**Data Availability: **All relevant data are within the paper and its Supporting Information files.

**Funding: **The authors have no support or funding to report.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Analysis of periodic patterns is an essential part of many studies of gene expression involving timeline sampling or targeting of rhythmically expressed genes. Recent publications report a large proportion of the entire transcriptome oscillating in a circadian (i.e. approximately daily) rhythm [1–3]. The number of genes for which circadian baseline can be identified as statistically significant over stochastic noise is traditionally thought to be under 10% [4–6], but more recently estimated as 43% [1] or even close to 100% [7], depending on the algorithms applied. Significance of the signal-to-noise ratio is the focus of most studies targeting rhythmic expression. The absolute amplitude and time of the peak (i.e. phase) of rhythmic gene expression are also analyzed and reported. However, we feel that one aspect of rhythmic gene expression required additional consideration. It has been observed that low sampling frequency presents a significant challenge to all studies of periodic gene expression ([7] for review). Most gene expression studies only report 6 or 9 observation points per period and not more than two consecutive periods in the entire timeline. Some oscillating genes may have peak expression coinciding at, or near, the observation point (i.e. the time when the animal is sacrificed and tissue samples are taken for analysis). However, other genes may peak at any time between sparsely placed observations. Since our ability to differentiate events in time is restricted by the low sampling rate, how can we be sure that genes are expressed in the phase we identified? Would it be possible to make a quantitative estimation of confidence that a gene peaks at a certain time of the day? With such a metric we could separate a fraction of genes for which we know the true time of peak and analyze the function of genes at a given time with less noise (i.e. genes highly expressed, but peaking at a different time) mixed in. To answer these questions and enable time-wise analysis of gene function and interactions, we propose a novel algorithm for the estimation of confidence of phase assignment in analysis to timeline expression profiles.

To answer the questions posed for this study we propose to use the bootstrap, which is a general technique for estimating unknown quantities associated with statistical models. Often the bootstrap is used to find

- standard errors for estimators,
- confidence intervals for unknown parameters,
*p*-values for test statistics under a null hypothesis.

*R*= 999) that retain the basic shape, local peaks and troughs and time independence of the original time series, by being strongly dependent on it. The maximum entropy bootstrap is particularly useful for short time series.

## Materials and Methods

### Notations

- I: indicator function.
*n*: the sample size.*p*: the number of genes.*χ*= {*X*_{1}, …,*X*_{n}}: random sample from population.-
: resample obtained by sampling from
*χ*. *α*: level of confidence.-
: estimate of
*θ*, computed from*χ*. -
: bootstrap version of , computed from
*χ**.

### Phase estimation

We consider a gene expression time series {*x*_{1}, …, *x*_{n}}. Without loss of generality, suppose that the measurements are taken in time points *t* = 0, 4, 8, …, 44*h*. We can then construct a collection of intervals named *phases* and labeled *G*_{0}, *G*_{1}, …, *G*_{5} such that
(1)
Let *θ* be the first *peak time* or the *phase* of the gene expression. We are interested in *estimating* and in a later step *constructing a confidence interval* for *θ*. More precisely, we want to construct an interval contained in one of the classes *G*_{i}, and that contains the estimated parameter with high probability.

The expression profile of a gene exhibiting circadian rhythmicity approximates to a cosine wave with a period *T* = 24h. A significant correlation can therefore be found between rhythmically expressed gene and a theoretical cosine wave cycling with an appropriate phase. The process of estimating *θ* consists of the following steps:

- Generate 6 cosine waves with the equation given below (2) The following properties apply: the periods are 24h, 48h long (two cycles), and the intervals between adjacent phases is 4h. Fig 1 is a graphical representation of the cosine Eq (2).
- Calculate the correlation coefficient between the gene expression profile and each of the 6 cosine waves 𝒞
_{φ}. Let denotes the obtained vector of correlations. Let (3) be the highest correlation and the phase of the corresponding cosine wave. The optimal is selected to be the representative of the circadian rhythmicity if the correlation is significant. Our parameter of interest*θ*is then estimated by the peak of the*best-correlated*cosine curve, and it is equal to (4)

Graph representing the cosine waves: for *t* = 0,4,8, …, 44 and *φ* ∈ {0, *π*/3, 2*π*/3, *π*, 5*π*/3}. The dotted vertical line shows the first *peak time*.

### Data resampling using Maximum Entropy Bootstrap Algorithm

Several bootstrap methods have been proposed for time series data. The most well-known is the *Moving Block Bootstrap*. This procedure works by dividing the observations in blocks of length *b* and then resampling the blocks (See Fig 2 for an illustration). The main problem with the block bootstrap is that the block length, *b*, which is a form of smoothing parameter, needs to be chosen. If the blocks are too short, the bootstrap samples cannot mimic the original sample. In this case dependency is broken whenever we start a new block. If, on the other hand, the blocks are too long, we will lose the randomness of the replicates. For these reasons, in this study we apply the maximum entropy bootstrap algorithm proposed by [8]. It does not impose strong assumptions on the distribution of the time series like stationarity. A full description of the algorithm can be found in [9]. The replications are generated by the following steps

- Form order statistics
*x*_{(t)}by sorting increasingly the original data, and keep the vector of ordering index. - Using the ordering statistics obtained at step 1, compute the intermediate points
*z*_{(t)}= (*x*_{(t)}+*x*_{(t + 1)})/2 for*t*= 1, …,*n*− 1. - For
*t*= 1, …,*n*, construct the deviation*x*_{(t)}−*x*_{(t − 1)}, and calculate the trimmed mean*m*_{trm}of the obtained observations. The lower limit for left tail is*z*_{0}=*x*_{(1)}−*m*_{trm}and upper limit for right tail is*z*_{n}=*x*_{(n)}+*m*_{trm}.*z*_{0}and*z*_{n}are the new limiting intermediate points. - Compute the mean of the maximum entropy (ME) density within each interval while satisfying the
*mean-preserving constraint*. - Generate uniformly distributed numbers on the [0, 1] interval, then calculate sample quantiles of the Maximum Entropy at the generated points and sort them.
- Using the ordering index of step 1, reorder the sorted sample. This process permits to conserve the dependance relationships among observations in the original data.
- The steps 2 to 6 are repeated many times, in our analysis we use
*R*= 999.

Graph showing the principal of moving block bootstrap. The moving block bootstrap randomly selects blocks of the original data (top) and concatenate them together (center) to form a resample (bottom).

A complete simulated example for illustration of each step of the algorithm can be found in [9]. Fig 3 shows one gene expression time series from the IWAT data, along with 24 different replicates of the series chosen randomly from 999 used in the analysis. Due to the fact that the maximum entropy algorithm tries to retain all the properties of the data, one can see that the replicates remain close to the original time series.

(Top left panel): A gene expression time series from the IWAT data. (Remaining:) Set of 24 replications randomly chosen from 999 maximum entropy bootstrap samples used in the analysis.

### The Bootstrap Approach for *p*-value

Let denote the realized value of a test statistic *τ* computed for a particular sample. Then is the definition of the *p*-value in situations where large values of *τ* support the alternative hypothesis. The process of calculating *p*-value consists of the following steps:

- Specify a way to generate bootstrap samples that resemble the real data while satisfying the null hypothesis
*H*_{0}. In our case we will use the**Maximum Entropy Bootstrap Algorithm**. - Let
**MEBA**denote this**bootstrap data-generating process**. - Using
**MEBA**, generate*R*= 999 bootstrap samples indexed by*j*. From each of them, compute a bootstrap test statistic . To estimate a bootstrap*p*-value, we use (5) Arguments in favor of the latter formulae for calculating*p*-value instead of the classical formulae , can be found in [10], p. 148, 161). For example, if 73 of the are greater than , then . - Reject the null hypothesis
*H*_{0}if . Where*α*is a given constant satisfying 0 <*α*< 1. In general we take*α*= 0.05.

This algorithm will be used to assess significance of the correlation between a gene expression time series and one of the cosine Eq (2).

### Bootstrap Percentile Confidence Interval

The main focus of this paper is to give an accurate approximate confidence interval for *peak time* parameter . Computing such confidence intervals with distributions that are difficult to represent mathematically, is very challenging. The bootstrap is another class of general methods for constructing confidence intervals without making strong distributional assumptions about the data or the statistic being calculated. There are several ways to construct bootstrap confidence intervals. They vary in ease of calculation and accuracy. There have been three main lines of development: Efron’s original percentile method [11], the bootstrap *t* interval introduced in [12], and the double bootstrap interval introduced in [13]. In this work, due to its simplicity and good performance, we use the Bootstrap Percentile Confidence Interval.

Let be an estimator of *θ* on the measured data *X*_{1}, …, *X*_{n}, and be its analog on a bootstrapped sample , then:
(6)
Where *K*_{boot} is the empirical distribution function of the bootstrap values. Efron’s (1979) original 100(1 − 2*α*)% bootstrap *percentile interval* is to just take the empirical 100*α* and 100(1 − *α*) percentiles from the bootstrap values . Then the 100(1 − 2*α*)% percentile interval is
(7)
where is the inverse or the generalized inverse distribution function or quantile function. The name percentile comes from the fact that and are percentiles of the bootstrap distribution *K*_{boot} in Eq (6). In practice, we proceed as follows:

- Generate
*R*bootstrap samples of size*n*using the maximum entropy algorithm. - Estimate the parameter
*θ*of interest for each bootstrap sample: for*b*= 1, …,*R*. - Order the bootstrap replications of such that . The lower and upper confidence bounds are the
*R*.*α*^{th}and*R*.(1 −*α*)^{th}ordered elements, respectively. The estimated (1 − 2*α*) confidence interval of is (8)

Fig 4 summarizes the steps of the Bootstrap Percentile confidence interval principle.

Schematic of the bootstrap process. We want to estimate a confidence interval for the phase *θ*(*χ*). *R* training sets, *χ*^{1}*, …, *χ*^{R}* each of size *n* are generated using an appropriate resampling mechanism. The quantity of interest *θ*(*χ*) is computed from each bootstrap training set, and the values are used to construct a confidence interval for the quantity *θ*(*χ*).

**Remark 1**. If *R*.*α* is not an integer, the following procedure can be used:

Let *k* = [(*R* + 1)*α*], the largest integer ≤ (*R* + 1)*α*. Then we define the empirical *α* and (1 − *α*) quantities by the *k*^{th} largest and (*R* + *k* − 1)^{th} values of , respectively. So if *R* = 999 and *α* = 2.5% these are the 25^{th} and 975^{th} ordered elements.

We have now all the pieces needed to accomplish the phase confidence analysis. Algorithm 1 summarizes the details of the proposed approach

**Algorithm 1**: Confidence in phase definition for periodicity in genes expression time series

**Data**: *χ* = {*x*_{1}, …, *x*_{n}}: *n* realizations of a gene expression time series, the number of replications *R*, and a confidence level *α*.

**Result**: Bootstrapped *p*-value, Bootstrap Percentile Confidence Interval .

1 **for** *b* ← 1 **to** *R* **do**

2 Using the maximum entropy bootstrap algorithm, generate a bootstrap sample *χ*^{b*};

3 Calculate the maximum correlation using formula Eq (3);

4 Estimate the peak time using formula Eq (4);

5 Calculate the bootstrapped *p*-value using Eq (5);

6 **if**
**then**

7 the gene is considered as circadian.

8 Calculate the Bootstrap Percentile Confidence Interval using Eq (8).

9 **if** *it exist* *i* ∈ {0, …, 5} such that **then**

10 the gene is assigned to the phase *G*_{i}, where *G*_{j} are defined in Eq (1).

## Results, Discussion, and Conclusions

We conducted experiments on three real previously published data sets. The data are derived from microarray study of gene expression in three tissues in mice referred as Inguinal White Adipose tissue (IWAT), Brown Adipose Tissue (BAT) and Liver. Each individual data set contains more than 22,000 gene expression profiles. Each profile consists of 12 time points of 4-h interval difference. See [14] for detailed description. In the first step of our analysis, we estimated the phase of each gene using the Eq (4), and we identified the circadian gene expression based on the Algorithm 1. We note here that our aim is not to identify all the circadian genes, but we are more interested in genes for which the peak time is near to one of the time points where the measurements are taken. Detection of circadian genes can be sophistically performed using Fisher’s *g*-test, autocorrelation or permutation test (See [15] for more details). This estimation revealed 646 oscillatory genes in the IWAT data, 680 in the BAT data, and 747 in the Liver data for which the bootstrapped *p*-value was ≤ 0.05, representing 6.9%, 7.15%, and 7.6% of the number of oscillatory genes obtained by applying a permutation test, respectively.

We used our proposed method to calculate a 95% confidence interval for the *peak time* of the oscillating genes, and then we assigned a phase to each of them using the following rule: a circadian gene is assigned to a Phase *G*_{i} if .

The Results of phase classification are summarized in Table 1 and Fig 5. In the IWAT data, and with a confidence levels of at least 95%, 28 genes peak at Phase *G*_{0}, 47 at Phase *G*_{1}, 223 at Phase *G*_{2}, 128 at Phase *G*_{3}, 117 at Phase *G*_{4}, and 103 at Phase *G*_{5}, representing 4.33%, 7.27%, 34.52%, 19.81%, 18.11%, and 15.94% of the oscillating genes, respectively. In the BAT data, 57 peak at Phase *G*_{0}, 110 at Phase *G*_{1}, 158 at Phase *G*_{2}, 145 at Phase *G*_{3}, 147 at Phase *G*_{4}, and 63 at Phase *G*_{5}, representing 8.38%, 16.17%, 23.23%, 21.32%, 21.61%, and 9.26% of the oscillating genes, respectively. For the Liver data set, 68 genes peak at Phase *G*_{0}, 136 at Phase *G*_{1}, 164 at Phase *G*_{2}, 176 at Phase *G*_{3}, 136 at Phase *G*_{4}, and 67 at Phase *G*_{5}, representing 9.10%, 18.20%, 21.95%, 23.56%, 18.20%, and 8.97% of the oscillating genes, respectively. The method for estimation of phase assignment confidence that we proposed allows some useful observation even on the testing data. For instance, we may ask how uniform is gene expression over time? For the experiments collecting data in circadian timeline we can formulate the Null-hypothesis stating that the same number of genes can be confidently assigned to each phase group. The alternative hypothesis would state that at least one phase group has significantly different number of genes. Both hypotheses are consistent with the overall number of rhythmically expressed genes and cannot be testes without quantitative estimation of confidence of phase assignment. In our test data we apply the same *p* = 0.05 threshold, but observe fewer genes peaking at one of the phases. In biological terms this means the in murine adipose tissue there is a period (morning hours) when the overall gene expression activity is lower compared to all other times of the day.

Bar plot summarizing the number of genes in each phase for the IWAT, BAT and Liver data sets from the results in Table 1.

However, it is even more important that our method can be applied to increase precision of observation in many studies involving timeline observation of gene expression. The sampling frequency still imposes limitation on our ability to separate molecular events (such as peak of gene expression) in time. To know the time of peak expression more precisely the experiment has to be repeated with higher a number of time points (for example, one sample every 2 hours rather than every 4 hours). However, with our method we can refine the existing data. For the groups peaking at a certain time we can be confident (at a selected confidence level) that certain genes peak at a certain time and filter out genes peaking sometime between out observation time points. This confidence is essential for functional annotation of co-expressed genes and can be critical in analysis of permutation of gene activity in reaction to environment or medication.

## Strengths and boundaries

We compare the proposed method with some competing algorithms, namely Fisher’s *g*-test [16], Permutation test [15], and JTK-CYCLE [17]. All methods except the permutation test are implemented in R, and run on an Itel core *i*7 at 3.40 GHz. The permutation test is implemented in C++. Tables 2, 3 and 4 show some results for the IWAT, BAT and Liver data sets.

In this paper, we are interested in genes that may have a peak expression coinciding or near one of the observation points. We approximate their expression profiles by an ideal cosine wave of the form:
(9)
We know that for circadian genes we have *T* = 24h. For the data sets used in this paper, the measurements time are *t* ∈ {0,4,8, …, 44}. Since we are interested by the first peak expression time, the possible time points to be considered are *t* ∈ {0,4,8, …, 20}. If we solve for equations 𝒞_{φ}(*t*) = 0 for *t* ∈ {0,4,8, …, 20}, we obtain *φ* ∈ {0, *π*/3, 2*π*/3, *π*, 4*π*/3, 5*π*/3}, this explains the use of *π*/3 as a resolution power of estimated phase in Eq (2). If we choose different values of the resolution power of estimated phase, the peak time of the generated ideal cosine waves will not necessarily coincide with one of the time points when the measurements were taken. Nevertheless, the method is general. It can work for periods other than 24 hours, for different spacing time points, and it can works with a larger number of cosines waves with smaller phases. For example, for any integer *k* we can generate 2*k* cosine waves using the equation:
(10) Table 2 shows some timing results for *k* = 30, which generates 60 cosine waves. Results are given for *R* ∈ {9,99,999} bootstrap replications. Like any method based on resampling, the proposed method can be computationally expensive, because it involves fitting the same statistical method a large number of times using different replications of the original data. We can see that the average CPU timings increases with number of generated cosine waves and the number of bootstrap replications.

The number of bootstrap replications *R* is in {9, 99, 999}.

Table 3 shows some timing results for the three different datasets; Fisher’s *g*-test is faster, followed by JTK-CYCLE and then the proposed method (one replication). We note here that the computing performance of the proposed method can be enhanced considerably (See **Remark 4**).

Table 4 shows the number of identified circadian genes. The Permutation test identifies the highest number, followed by the JTK-CYCLE and then Fisher’s *g*-test. Our method is not developed for detecting all the circadian genes, but rather it detects, with high confidence, the circadian gene for which the peak time (the phase) is near one of the time points; estimates this phase, and constructs a confidence interval for it. This explains the small number of circadian genes detected by our method compared to the competitors.

**Remark 2**. This experiment design is rather typical for circadian biology. Some experiments collect samples at different intervals, such as 3h or, rarely, every 2h. Higher sampling frequency improves resolution ability, but costs a lot more and is harder to implement.

**Remark 3**. Gene expression profiles are analyzed independently, thus it is possible that a researcher may find few or none of the gene confidently peaking at a given time. In fact, in the data set on which we tested the method, gene expression has a quiet period at which relatively few genes are active.

**Remark 4**. We note that the computational performance of the proposed method can be enhanced. In fact, if we avoid using *loops* in R script that process one element per iteration, and instead we use *apply* family of functions that process whole rows, columns, or lists, the computing time is reduced significantly. In this case we need just 0.001 second to run the method for one replication using Eq (2), and we need 0.008 second to run the method using higher number of cosine waves using Eq (10).

## Acknowledgments

We thank Christopher Leonard from QScience, Qatar Foundation, for improving the quality of the manuscript.

## Author Contributions

Analyzed the data: ME. Wrote the paper: ME AP. Developed the algorithm and implemented the code: ME. Discussed the results: AF AP. Formulated the problem: AP.

## References

- 1. Zhang R, Lahens NF, Ballance HI, Hughes ME, Hogenesch JB. A circadian gene expression atlas in mammals: implications for biology and medicine. Proceedings of the National Academy of Sciences. 2014;111(45):16219–16224.
- 2. Klevecz RR, Li CM, Marcus I, Frankel PH. Collective behavior in gene regulation: the cell is an oscillator, the cell cycle a developmental process. FEBS journal. 2008;275(10):2372–2384. pmid:18410382
- 3. Ptitsyn AA, Reyes-Solis G, Saavedra-Rodriguez K, Betz J, Suchman EL, Carlson JO, et al. Rhythms and synchronization patterns in gene expression in the Aedes aegypti mosquito. BMC genomics. 2011;12(1):153. pmid:21414217
- 4. Panda S, Antoch MP, Miller BH, Su AI, Schook AB, Straume M, et al. Coordinated transcription of key pathways in the mouse by the circadian clock. Cell. 2002;109(3):307–320. pmid:12015981
- 5. Storch KF, Lipan O, Leykin I, Viswanathan N, Davis FC, Wong WH, et al. Extensive and divergent circadian gene expression in liver and heart. Nature. 2002;417(6884):78–83. pmid:11967526
- 6. Bray MS, Shaw CA, Moore MW, Garcia RA, Zanquetta MM, Durgan DJ, et al. Disruption of the circadian clock within the cardiomyocyte influences myocardial contractile function, metabolism, and gene expression. American Journal of Physiology-Heart and Circulatory Physiology. 2008;294(2):H1036–H1047. pmid:18156197
- 7. Ptitsyn AA, Gimble JM. True or false: All genes are rhythmic. Annals of medicine. 2011;43(1):1–12. pmid:21142579
- 8. Vinod HD. Maximum entropy ensembles for time series inference in economics. Journal of Asian Economics. 2006;17(6):955–978.
- 9. Vinod HD, López-de Lacalle J. Maximum entropy bootstrap for time series: the meboot R package. Journal of Statistical Software. 2009;29(5):1–19.
- 10.
Davison AC, Hinkley DV. Bootstrap methods and their application. vol. 1. Cambridge university press; 1997.
- 11.
Efron B. Bootstrap methods: another look at the jackknife. The annals of Statistics. 1979;p. 1–26.
- 12.
Efron B, Efron B. The jackknife, the bootstrap and other resampling plans. vol. 38. SIAM; 1982.
- 13.
Hall P. On the bootstrap and confidence intervals. The Annals of Statistics. 1986;p. 1431–1452.
- 14. Zvonic S, Ptitsyn AA, Conrad SA, Scott LK, Floyd ZE, Kilroy G, et al. Characterization of peripheral circadian clocks in adipose tissues. Diabetes. 2006;55(4):962–970. pmid:16567517
- 15. Ptitsyn AA, Zvonic S, Conrad SA, Scott LK, Mynatt RL, Gimble JM. Circadian clocks are resounding in peripheral tissues. PLoS Comput Biol. 2006;2(3):e16. pmid:16532060
- 16. Wichert S, Fokianos K, Strimmer K. Identifying periodically expressed transcripts in microarray time series data. Bioinformatics. 2004;20(1):5–20. pmid:14693803
- 17. Hughes ME, Hogenesch JB, Kornacker K. JTK CYCLE: an efficient nonparametric algorithm for detecting rhythmic components in genome-scale data sets. Journal of biological rhythms. 2010;25(5):372–380. pmid:20876817