A novel mathematical method for disclosing oscillations in gene transcription: A comparative study

Circadian rhythmicity, the 24-hour cycle responsive to light and dark, is determined by periodic oscillations in gene transcription. This phenomenon has broad ramifications in physiologic function. Recent work has disclosed more cycles in gene transcription, and to the uncovering of these we apply a novel signal processing methodology known as the pencil method and compare it to conventional parametric, nonparametric, and statistical methods. Methods: In order to assess periodicity of gene expression over time, we analyzed a database derived from livers of mice entrained to a 12-hour light/12-hour dark cycle. We also analyzed artificially generated signals to identify differences between the pencil decomposition and other alternative methods. Results: The pencil decomposition revealed hitherto-unsuspected oscillations in gene transcription with 12-hour periodicity. The pencil method was robust in detecting the 24-hour circadian cycle that was known to exist, as well as confirming the existence of shorter-period oscillations. A key consequence of this approach is that orthogonality of the different oscillatory components can be demonstrated. thus indicating a biological independence of these oscillations, that has been subsequently confirmed empirically by knocking out the gene responsible for the 24-hour clock. Conclusion: System identification techniques can be applied to biological systems and can uncover important characteristics that may elude visual inspection of the data. Significance: The pencil method provides new insights on the essence of gene expression and discloses a wide variety of oscillations in addition to the well-studied circadian pattern. This insight opens the door to the study of novel mechanisms by which oscillatory gene expression signals exert their regulatory effect on cells to influence human diseases.


Introduction
Gene transcription is the process by which the genetic code residing in DNA is transferred to RNA in the nucleus as the inauguration of protein synthesis. The latter process is called translation and occurs in the cytoplasm of the cell. Circadian rhythm, the 24-hour cycle that governs many functions of the cell, is the result of a complex interaction of transcriptional and translational processes. The importance of circadian rhythm to physiologic processes has been underscored in 2017 by the awarding of the Nobel Prize in Physiology or Medicine to the investigators who described the molecular mechanisms controlling it. However, in addition to the circadian oscillation driven by light and dark, other so-called infradian and ultradian rhythms have clear biologic import. Blood pressure, some circulating hormones, and some physiological functions appear to have 12-hour periodicity whereas other processes such as the menstrual cycle more closely follow a lunar cycle.
Accordingly, we sought to uncover novel 12-hour oscillations in gene expression. In many cases, the 12-hour gene oscillation is superimposed on the 24-hour cycle; thus it is hidden in conventional analysis. Additionally, experiments designed to elucidate the 24-hour circadian often do not have the granularity required to reveal an interval of less than 24 hours as they are constrained by the Shannon-Nyquist Sampling Theorem [1].
To reveal periodicities in gene expression other than the 24-hour circadian cycle, we applied digital signal processing methodology to this biologic phenomenon. Although this approach is, to our knowledge, less commonly used in the biological field, it is justified because the transcription of DNA to RNA is indeed a signal, packed with information for making the enormous repertoire of proteins.
To extract the fundamental oscillations (amplitude and period) present in the data, we utilized publicly available time-series microarray datasets on circadian gene expression in mouse liver (under constant darkness) [2] and analyzed over 18,000 genes spanning a variety of cellular process ranging from core clock control, metabolism, and cell cycle to the unfolded protein responses (UPR), a measure of cell stress. In addition, one set of measurements of RER (respiratory exchange ratio) from wild-type mice (generated by us) was also performed. We constructed linear, discrete-time, time-invariant models of low order, driven by initial conditions, which approximately fit the data and thus reveal the fundamental oscillations present in each data set. In addition to the 24-hour (circadian) cycle known to be present, other fundamental oscillations have been revealed using our approach.

Methods
We searched for 12-hour oscillations in several biological systems. Systems were chosen that represented not only gene transcription but also phenotype; they represent the way in which these biological systems are expressed in the whole organism. The reasoning was that if the 12-hour oscillation in transcription was biologically significant, it would be represented in some measurable function of the cell.
Initially, we analyzed a set of transcription data [2] that was collected in mouse liver obtained from animals in constant darkness after being entrained in a 12-hour light/12-hour dark environment. Mice were sacrificed at 1-hour intervals for 48 hours, thus providing enough data points to analyze the signal. The dataset thus obtained contains RNA values for all coding genes. The RNA data were generated using a standard microarray methodology. In addition, RER (respiratory exchange ratio) measurements in mice were also measured and analyzed. The novelty in our analysis consists in using the so-called matrix-pencil method [3]. This is a data-driven system-identification method. It constructs dynamical systems based on time-series data and finds the dominant oscillations present in the ultradian or infradian rhythms. Our purpose here is to compare this method with other established strategies for spectral estimation, including both parametric spectrum estimation methods like MUSIC (MUltiple Signal Classification), ESPRIT (Estimation of Signal Parameters via Rotational Invariance Techniques), and Prony's (least squares) as well as classical nonparametric models like wavelet transforms and statistical methods like RAIN. These are compared with each other using both artificial and measured data.

Basic signal processing methods
• The data. We consider finite records of data resulting as described above. Generically they are denoted by y i , i = 1, Á Á Á, N.
• Basic model: sum of exponentials. We seek to approximate the data by means of linear combinations of exponentials plus noise. Thus we seek k pairs of complex numbers is the noiseless part of the signal and w(t) is the noise. The requirement is: y(m) % y m , m = 1, 2, Á Á Á, N. Existing approaches to address this problem are MUSIC, ESPRIT, Prony's (least squares) method, wavelet transform and statistical methods described later.
• Second model: descriptor representation. The equivalent descriptor model uses an associated internal variable xðtÞ 2 R k of the system. The resulting equations are: Exðt þ 1Þ ¼ AxðtÞ; yðtÞ ¼ CxðtÞ þ wðtÞ; x 2 R k ; ð2Þ • Third model: AR (Auto Regressive) representation. The above model can also be expressed as an AR model driven by an initial condition. As above we let y(t) = y Ã (t) + w(t), (where y Ã (t) is the noiseless term and w(t) the noise). It follows that (1) can be rewritten as: with initial conditions y Ã (ℓ), ℓ = 0, 1, Á Á Á, k − 1.
Goal. Discover the fundamental oscillations inherent in the gene data, using these models and reduced versions thereof. Processing of the data with the pencil method. The data y 1 , y 2 , Á Á Á, y N , are used to form the Hankel matrix: This quadruple constitutes the raw model of the data. This model is linear, time-invariant and discrete-time with a non-zero initial condition: Reduced models and fundamental oscillations. The dominant part of the raw system is determined using a model reduction approach [4], [5], [6], [3]. The procedure is as follows.
Pencil procedure for obtaining dominant sub-models.
• Compute the SVDs: ; ½u 2 ; s 2 ; v 2 ¼ svdð½E; AÞ: • Choose the dimension r of the reduced system (e.g r = 3, r = 5, r = 7 etc.). Then are used to project the raw system to the dominant subsystem of order r: The associated reduced model of size r is then: Assuming (as is usually the case) that E r is invertible, the approximated data can be expressed as:ŷ Estimating r. Important byproducts of the pencil method are the singular values s 1 and s 2 mentioned above. The accuracy of the approximation is determined by the first neglected singular singular value σ r+1 , as the resulting approximation error is proportional to this singular value. This implies the following rule.
Rule: choose r so that s r s 1 < , where is a tolerance which depends on the data at hand. For instance = 0.01, implies roughly speaking that data contributing less than 1% to the overall result are discarded. In this regard the following remark is in order. The data considered in this paper are rather short-duration and therefore in many cases we have not truncated the data.
Partial fraction expansion of the associated transfer function. H r (z) = C r (z E r − A r ) −1 B r . This involves the eigenvalue decomposition (EVD) of the matrix pencil (A r , E r ), or equivalently of E À 1 r A r ; let where the columns of V r = [v 1 , Á Á Á, v r ] are the eigenvectors, Λ r = diag[λ 1 , Á Á Á, λ r ] are the eigenvalues of the reduced system (poles of H r (z)), and ½v 1 ; Á Á Á ;v r are the rows of V À 1 r . The approximate data can be expressed as: where P i ¼ ½Cv i ½v i B, is the complex amplitude of the i th , oscillation; expressing this in polar form P i ¼ a i e jy i , α i is the real amplitude and θ i the phase. Finally, if we express the eigenvalues as l i ¼ e s i þjo i , σ i is the decay (growth) rate, and ω i the frequency, of the i th oscillation. Poles and oscillations. Often in (digital) system theory, the quantity l i 2 C is referred to as pole of the associated system. Oscillatory signals result when σ i = 0, which it turn implies that the magnitude of the pole λ i is equal to one: jλ i j = 1, and the period of oscillation is For instance a signal with λ i = 1, represents a constant (step), while signals with l i ¼ e j p 12 , l i ¼ e j p 6 (which are both on the unit circle with angles 15˚, 30˚degrees) represent pure oscillatory signals with periods 24, 12 hours respectively.
Angle between signals and orthogonality. In the sequel we will make use of angles between signals. Here we briefly define these concepts. Given discrete-time finite duration signals (vectors) their inner product is defined as where (Á) Ã denotes complex conjugation and transposition; the angle between these signals is defined as where kÁk denotes the Euclidean 2-norm. Orthogonality means that the angle between the two signals is p 2 , or equivalently that their inner product is zero; this is sometimes denoted by a ? b. In the sequel we also make use of the symbol to indicate approximate orthogonality, i.e. an angle between signals close to p 2 radians or 90˚degrees.

Other methods
To complete the picture, we briefly list other methods which can be used to analyze the gene data.
MUSIC. The MUSIC algorithm [7], [8], is a parametric spectral estimation method based on eigenvalue analysis of a correlation matrix. It uses the orthogonality of the signal subspace and the noise subspace to estimate the frequency of each oscillation. It assumes that a set of data can be modeled as is the transpose of a Vandermonde matrix, K is the number of dominant frequencies, and tains the amplitudes of the dominant K frequencies, n $ N ð0; s 2 n IÞ, is white noise. The autocorrelation matrix is where Λ = diag(λ i ) and M is the number of columns in the Hankel matrix. We can see that the rank of matrix ΓΛ 2 Γ H equals K where the nonzero eigenvalues are fl m g K m¼1 . Then the sorted eigenvalues of the autocorrelation matrix R xx can be expressed as l n ¼l n þ s 2 n ; n K; and s 2 n ; K < n N: It follows that the noise subspace contains the eigenvectors of the autocorrelation matrix R xx corresponding to the N − K smallest eigenvalues. Then The MUSIC algorithm can only provide the frequency information of the signal. To obtain the amplitude of each oscillation, we need to apply least squares fitting, where the amplitudes of dominant oscillations satisfy a = (Γ H Γ) −1 Γ H x. It should mentioned that in contrast with the pencil method, MUSIC cannot provide the decay (growth) rate of the oscillations.
ESPRIT. This is another parametric spectral estimation algorithm [7], [8]. It analyzes the subspaces of the correlation matrix. It estimates the poles relying on rotational transformation. As in MUSIC: where z j are the poles. We can construct Γ 1 = Γ(1: N − 1,:), and Γ 2 = Γ(2: N,:). The relationship between these two quantities is , is the phase shift matrix that represents a rotation. Now we construct a similar structure applying on signal subspace S that contains the eigenvectors of the autocorrelation matrix R xx corresponding to the K largest eigenvalues. Let Note that the relationship between S 1 and S 2 is S 2 = S 1 C. Because Γ and S have the same column space (see [7,8]), we have that Γ = ST, where T is an invertible subspace rotation matrix. So we have C = T -1 FT. Therefore the poles are the eigenvalues of C. Finally least square (LS) to obtain The eigenvalues of C, are the poles z i ¼ e jo i þs i . Thus ESPRIT can estimate both the frequency and the decay (growth) rate of the oscillations. However, as with MUSIC, we need to use LS to obtain the amplitude of each oscillation.
Wavelet transform. Wavelet transforms can be divided into two categories, the continuous (CWT) and the discrete (DWT) versions. CWT is more suitable for analyzing biologic rhythms because of the associated heat maps are two-dimensional.
In CWT a time signal x(t) is convolved with a wavelet function. This leads to a time-frequency representation which provides spectrum information in a local time window. This transform can be expressed as W c ðt; sÞ ¼ where s is the frequency scale, ψ Ã (t) is the wavelet function. Since the signal data is obtained by sampling, we can approximately rewrite the equation as W c ðt; sÞ ¼ P 1 n¼À 1 c Ã nÀ t s À Á xðnÞ. It follows that the integral or sum is applied on the range −1 to 1 that means the domain of signal x(t) or x(n) should be the range from −1 to 1. But the signals considered have finite length, in which case the edge effects become obvious, especially in the low-frequencies.
In practice, there are many wavelet functions that can be chosen, both real-and complexvalued. Real-valued wavelets are useful for treating peaks and discontinuities of signals while complex-valued wavelets yield the information of amplitude and phase simultaneously [9].
Statistical methods. In this section three statistical methods, namely ARSER, JTK_CYCLE and RAIN, will be investigated and their ability to detect biological rhythms evaluated. Those methods focus on the (one) most dominant oscillation in the data, especial JTK_CYCLE and RAIN. These constitute statistical tests that calculate the p-value to determine whether a certain rhythm exists in the data [10][11][12].
ARSER. ARSER uses the autoregressive (AR) model to obtain the period of oscillation. It then uses linear regression (harmonic) to determine the amplitude and the phase of the oscillation. Finally applying the F-test to pre-processed data and regressive data determines whether an oscillation exists.
Pre-processing the data. Because the data may not be stable, ARSER applies linear detrending to the raw data. It then uses linear regression to fit the data as a straight line. Subsequently ARSER uses a fourth-order Savizky-Golay algorithm to smooth the data. This low-pass filter removes the pseudo-peaks in the spectrum.
Finding the period. ARSER uses an autoregressive model to get the period of the oscillation. Given a pre-processed dataset fy t g N t¼1 with period interval Δ.
where t is white noise, α i are AR coefficients, n is the order of model (we choose n = lengthof-data/Δ). To calculate the coefficients, ARSER uses the Yule-Walker method, maximum likelihood estimation and the Burg algorithm. After AR modeling, ARSER can calculate the spectrum: where s 2 is the variance of white noise. ARSER finds the peaks in time window t 2 [20, 28] as the periods {T i } the oscillation (the optimal periods are determined by Akaike's information criterion).
Harmonic Regression. Now we can express the pre-processed data as: where β i1 and β i2 are the amplitudes. ARSER calculates those amplitude through linear regression. F-test. Using the F-test compares the approximation data fx t g and pre-processed data {x t }. The null and the alternative hypotheses are respectively where A i are the amplitudes which are calculated using linear regression, and r is the number of coefficients obtained by linear regression. We can calculate the F coefficient by: Then we can calculate the p value using the F-distribution p = P(F, r − 1, N − r), where P(Á) is the probability function used to calculate the p value based on F-distribution. JTK_CYCLE and RAIN. JTK_CYCLE and RAIN use statistical method to detect the trend in data. The former can find the increasing or decreasing trend in data and RAIN is a development of JTK_CYCLE which can combine these two.
A periodic waveform should start from the trough and increase to the peak following a decreasing part to a new trough. Because our data is sampling from the waveform, we can regard every time sampling data point as a variable. Thus we can get n variables fF i g n i¼1 for the waveform such that T = nΔ (T is the period of the waveform, Δ is the time interval of sampling point). We assume the variances of those variables are the same. And they have the same mean value only when the data only have noise without periodic oscillation. So the null and the alternative hypotheses are The alternative hypotheses for RAIN is Calculating the statistical coefficient of trend. Every variable F i , corresponds to a sampling dataset fX ij g m i j¼1 , where m i is the number of sampling data point of the i th variable ( P n c¼1 m c ¼ N). Let q i k ;j l ¼ 1 if X ik X jl , and 0 otherwise; and U ij ¼ P m i k¼1 P m j l¼1 q i k ;j l , which is the Mann-Whitney Ustatistic for comparison of two variables. For JTK_CYCLE, the statistical coefficient of trend is Thus G(z) for JTK_CYCLE and RAIN are both polynomials. We can get the distribution f(i) by calculating the coefficients of G(z), which can be used in the p-value equation.

Experimental results: Artificial data
In this section we test the performance of different methods using artificially generated signals. For the continuous wavelet transform, we chose the complex morlet wavelet because it allows changes to the resolution in frequency and time domain. For simulation data, we assume the data has the form where w is white noise with zero mean and variance σ 2 and f i is the i th oscillation, where: where A i is the amplitude, σ i is the decay (growth) rate, θ i is the phase and T i is the period. At first we assume that the samples are collected in unit time intervals. The parameters are defined in the table below; the first oscillation is almost constant with small decay; the other three oscillations have a period of approximately 24-12-and 8-hours (see Table 1). The experiment has the following parts. First, the sensitivity to noise is investigated. Here, the variance of noise is changed and the performance of each of the different methods is examined. Second, the impact of the length of the data is investigated. Finally, the frequency of data collection (can be referred to as sampling frequency) will be examined.
Recall that the Nyquist sampling theorem provides the lower bound for the sampling frequency in order to prevent aliasing. This can be used to determine appropriate sampling frequencies for continuous-time signals.
Sensitivity to noise. To test the sensitivity of these various methods to noise, we set the standard deviation of w as σ = [0, 0.03, 0.1, 0.3].  ESPRIT and MUSIC methods respectively. This figure shows that the pencil and ESPRIT methods yield a perfect fit in all situations. The MUSIC algorithm gives a good fit only for small amounts of noise. In Table 2, we display the poles obtained by using each method. In Fig 2, the heat map of the wavelet transform is shown. It follows that yellow region is such that we cannot distinguish two oscillations with close periods. We can recognize 12h and 8h oscillations when the noise is weak. However when the noise is strong (σ = 0.3), only the strongest oscillation can be determined. The edge effect is obvious and there are ghost lines e.g. around 15h, that may lead to false estimation.
From these considerations, we conclude that the pencil and ESPRIT methods are robust to noise. This is not the case for MUSIC and CWT.  Analysis of oscillations in gene transcription magenta are the estimated poles using the pencil, ESPRIT and MUSIC algorithm, respectively. For more accuracy, the poles are also listed in Table 3.

Rate of data collection (sampling frequency).
To investigate the impact of sampling of the underlying continuous-time signal, we generate artificial data with L = 50. Then we apply all methods to the original dataset, the half-data set (time collection interval I = 2) and thirddata set (that is 1, 4, 7, 10 Á Á Á with time collection interval I = 3). In Fig 4, the left-hand side plot below shows heat maps (Y-axis is frequency domain, X-axis is time domain) of simulation data (noise standard deviation 0.05) with duration L = [30, 50, 100, 200]. The right-hand side plot shows data fit for the various methods.
Conclusion. From the above considerations it follows that decreasing the sampling frequency does not affect the estimation significantly. This means that the data rate collection (sampling frequency) is not an important factor. In contrast, the data length is a crucial factor for all methods.

Experimental results: The pencil method applied to gene data
In this section we analyze a small part of the measured data in order to validata some of the aspects of the pencil method and its comparison with the other methods.

Batch consisting of 171 measurements every 40min
The results in this case are summarized in Table 4 and Fig 5 (S1 File. DATA 171 is a 10 x 171 matrix; the first row contains time; the remaining rows contain the measurements taken from 9 mice.) Batch consisting of RER for restrictively fed mice (218 meas. every 40min) (see Table 5 and S2 File. DATA 218 is a 10 x 218 matrix; the first row contains time; the remaining rows the measurements taken from 9 mice.). Fig 6 shows the approximation by 1, 2 and 3 oscillations (upper pane) and the first four fundamental oscillations (lower pane). Table 6 shows the error and the angles (S3 File. DATA 15 is a 15 x 48 matrix; each row corresponds to a different gene; time runs from 1 to 48 hours).
We analyze the relationship among the decomposed oscillations, by calculating the angle among these oscillations for 10 different genes. We set r = 9, i.e. the gene signals contain four Table 7 (S4 File. DATA 10 is a 10 x 48 matrix; each row corresponds to a different gene; time runs from 1 to 48 hours.) From the above tables, we can see that the angle between oscillations is around 90˚in most situations. So oscillations are nearly orthogonal: It has actually been shown in [13] that these oscillations are independent of each other.
Batch consisting of various measurements using mice-38 min intervals (see Table 8 (S4) and Table 9 as well as Fig 7 (S5 File. DATA 186 is a 6 x 186 matrix; the first row contains time; the rest represent: food intake, ambulatory activity, total activity, ZTOT and heat.)

Analysis of oscillations in gene transcription
Variation of data collection rate. We compare the oscillations using all data (AD), the first half of the data (FHD), the second half of the data (SHD), odd-position data (OD), and even-position data (ED). This is done for a particular set of measurements, but the results are indicative of what happens in general. Table 10 shows the estimated periods using different part of the data. It follows that the estimation of periods is consistent using AD, FHD, SHD.
Discussion and comments 1. Orthogonality. Recall the definition of angle between signals defined by (6), and let the original vector of measurements for one gene be denoted by y 2 R N ; let also f i , i = 0, 1, 2, 3,    2. Interpretation of orthogonality. Orthogonality means that once an oscillation (e.g. the circadian or the 12h rythm) has been determined, further computations will not affect these oscillations. In other words the fundamental oscillations are independent of each other.
3. Manifestation of orthogonality. As we determine higher-order approximants, i.e. as we add oscillations to the model, the existing ones remain mostly unchanged. Considering the case of the para probe1 gene, we apply the ESPRIT, LS (Prony's) and pencil methods. The statistical methods (e.g. ARSER) are not used because being non-parametric they do not allow the choice of the order of fit. ESPRIT and LS are not reliable for large orders of fit, Analysis of oscillations in gene transcription therefore the results for the 24-fit model is not shown. The poles of these three methods are depicted in Tables 11, 12 and 13. 4. Connection with the Fourier transform. The above method provides an almost orthogonal decomposition of a discrete-time signal. The question arises therefore as to whether the same or improved results can be obtained using the Fourier transform and in  . Therefore unless the frequencies of the underlying oscillations are exactly among the ones above, the results of the DFT are not useful.

The least squares (Prony's) method.
This method is not appropriate for cases where the poles are on or close to the unit circle (pure or almost pure oscillations). Fig 8 depicts this fact in the case of the RER data. The conclusion is that while the matrix pencil method (red dots) gives oscillatory poles, this is by far not the case with the LS (prony's) method (green dots).

Final result
We considered a dataset consisting of 18484 genes; transcription is analyzed using the pencil method [3], the ESPRIT method, Prony's method and the three statistical methods. The distribution of the poles follow; recall that the poles of ideal oscillations have magnitude equal to 1. Furthermore the DFT and wavelet methods are also not competitive. Fig 9 shows that the pencil method has uncovered real oscillations, since the mean of the magnitude of all poles is 1.0058 and the standard deviation is 0.0010. The ESPRIT method follows in terms of discovering oscillations, while the Prony or LS (least squares) method and the three statistical methods give weak results. As explained above the main drawback of the ESPRIT method is that it has nothing to say about the orthogonality of the oscillations, which proves to be a key outcome of the pencil method.
A key consequence of the matrix-pencil approach is the demonstration of orthogonality of the different oscillatory components, in particular the 24-hour and the 12-hour cycles. This points to an independence of these oscillations. This assertion has been subsequently confirmed in the laboratory experiments reported in [13].
This analysis demonstrates the applicability of signal processing methodologies to biological systems and further shows the ability of the matrix pencil decomposition to demonstrate independence of biological rhythms.