How one might miss early warning signals of critical transitions in time series data: A systematic study of two major currency pairs

There is growing interest in the use of critical slowing down and critical fluctuations as early warning signals for critical transitions in different complex systems. However, while some studies found them effective, others found the opposite. In this paper, we investigated why this might be so, by testing three commonly used indicators: lag-1 autocorrelation, variance, and low-frequency power spectrum at anticipating critical transitions in the very-high-frequency time series data of the Australian Dollar-Japanese Yen and Swiss Franc-Japanese Yen exchange rates. Besides testing rising trends in these indicators at a strict level of confidence using the Kendall-tau test, we also required statistically significant early warning signals to be concurrent in the three indicators, which must rise to appreciable values. We then found for our data set the optimum parameters for discovering critical transitions, and showed that the set of critical transitions found is generally insensitive to variations in the parameters. Suspecting that negative results in the literature are the results of low data frequencies, we created time series with time intervals over three orders of magnitude from the raw data, and tested them for early warning signals. Early warning signals can be reliably found only if the time interval of the data is shorter than the time scale of critical transitions in our complex system of interest. Finally, we compared the set of time windows with statistically significant early warning signals with the set of time windows followed by large movements, to conclude that the early warning signals indeed provide reliable information on impending critical transitions. This reliability becomes more compelling statistically the more events we test.


Introduction
Since Scheffer et al. published their 2009 [1] and 2012 [2] reviews on early warning signals (EWSs) preceding regime shifts, there has been an explosion in the number of papers on this topic for various complex systems. In Table 1, we summarized EWS papers published between 2014 and 2017, showing the types of complex systems they were dealing with, and whether they could observe the various EWSs. While most of these papers successfully detected PLOS

Pre-processing
We first imported the raw data from the Tick History comma-separated (CSV) files into Matlab data structures. We then read through the ticks, and extracted exchange rates at fixed time intervals (T 0 ) (See Text A in S1 Protocol for Matlab script) that we can specify to obtain our time series data. From the theory of critical transitions [1,2], we know that as we approach a tipping point, not only will we observe long-term trends in the slow variables, we will also detect a slowing down in the fluctuations of fast variables. When both effects are present, it is difficult to reliably interpret the EWIs. Therefore, it is important to first remove the long-term trends from the time series data. The simplest way to do de-trending is to use a rolling window. However, the local trends obtained this way do not change smoothly from one rolling window to the next. Therefore, we used a Gaussian kernel to smooth the data [4,23,27] (See Text B in S1 Protocol for Matlab script). It is also possible for us to use the LOESS method of non-parametric local regression [14], or more sophisticated methods such as the de-trending algorithms used in the detrended fluctuation analysis (DFA) [35], and the empirical mode decomposition [36]. A systematic comparison of the performance of different de-trending methods is outside the scope of this paper.
In Fig 1(A), we show the T 0 = 15 s time series for AUDJPY between 11:15 AM and 17:00 PM on Oct 6, 2008 as a red solid curve, and the trend obtained after smoothing with a Gaussian kernel with bandwidth σ = 100T 0 as a blue dashed curve. We then show in Fig 1(B) the blue residue time series obtained by subtracting the Gaussian-smoothed time series from the exchange rate. Our EWS analysis in the rest of the paper will be based on the residue time series.

Early Warning Indicators (EWI)
After removing the long-term trends, we tested the residue time series (see Fig 1(B)) for critical slowing down. This was done by calculating three EWIs: (1) the lag-1 autocorrelation (AC (1)) where " x i is the mean of the sequence (x n ), and N is the number of total elements in the sequence; (2) the variance (Var) of the sequence (x n ); and (3) the low-frequency power spectrum (LFPS). Given a sequence (x n ), we define its discrete Fourier transformation as where k is an integer from 0 to N − 1. The power spectrum P k = |X k | 2 , k = 0,. . .,N − 1, is then normalized so that its sum is 1. Finally, the LFPS is then calculated to be the power residing in the first 6% elements of the sequence P k . The LFPS of a sequence (x n ) measures the weightage of the low-frequency part in the power spectrum.

Testing for significant EWSs
Increasing trends. When we slide a rolling window of length R win (corresponding to a window duration of T 1 = R win Á T 0 ) over the entire residue time series with rolling step R step (set to be one third of R win ), we create the time series of indicators. For all three indicators, an EWS corresponds to an increasing trend in the indicator values. Therefore, we test EWSs for statistical significance within rolling windows of indicators of length R ind (corresponding to a window duration of T 2 = (R ind Á (R step −1) + R win ) Á T 0 ) with rolling step 1 along the time series of the three indicators respectively.
Within each rolling window of length N = R ind , we calculate the Kendall-tau values, also known as the Kendall rank correlation coefficient [37], where N concordant pairs is the total number of concordant pairs and N disconcordant pairs is the number of disconcordant pairs. Suppose t 1 < t 2 , the pair of ðx t 1 ; t 1 Þ and ðx t 2 ; t 2 Þ is said to be concor- To see how this works, let us consider the ordered series: (2,4,3,8). Except for 4 coming before 3, the rest of the series has an increasing trend. To see this using the Kendall-tau coefficient, we note that according to the definition, there are 5 concordant pairs: (2, 4), (2,3), (2,8), (4,8), (3,8), and 1 disconcordant pair: (4,3). In total there are 4ð4À 1Þ 2 ¼ 6 pairs. In this example, the Kendall-tau coefficient is ð5À 1Þ 6 ¼ 2 3 , which is fairly large. In general, a time series with a strong increasing trend will have a high Kendal-tau coefficient.
To determine the statistical significance of the Kendall-tau value of a given rolling window of indicators, which has R ind indicators corresponding to R ind Á (R step −1) + R win data points in the residue time series, we first reshuffle the residue time series of R ind Á (R step −1) + R win data points, to create a null model residue time series that has the same mean and variance as the subject time series, but whose time ordering is completely destroyed. Here, let us point out that normally, to test the Kendall-tau of the indicator time series for statistical significance we reshuffle the indicator time series. By reshuffling the residue time series instead, we are making the significance tests stricter. We repeat this procedure 1000 times to create a histogram of 1000 Kendall-tau values for the null model. The p value of the subject Kendall-tau is then the percentage of null-model Kendall-tau values that are greater than the subject Kendall-tau value. For the purpose of this paper, if p 0.05, we regard the EWS in this time interval as significant.
Concurrence. A period with one statistically significant EWI points to an impending critical transition. However, the other EWIs may not be statistically significant over the same period, or they may be statistically significant over slightly different periods. Since it is possible for a statistically significant EWS to be a false positive, we can reduce the false-positive rate by requiring all three EWIs to be statistically significant over the same overlapping period. With this concurrent set of EWIs, the probability of the overlapping period being a statistical false positive should be significantly reduced.
Endpoint. Sometimes we encounter situations where the rising trends of the indicators are statistically significant but the indicators values remain small at the end of the T 2 time windows. We show in Fig 2 the magnitude of last indicator value in a T 2 time window, and call it the endpoint of the indicator. If the endpoint is small, we do not expect to find a critical transition shortly after the EWS even if the rising trend is significant. We expect a critical transition only if the rising trend is statistically significant and the endpoint is large.
To decide whether the endpoint is large or small, we build the histogram shown in Fig 3 of the endpoints of T 2 rolling windows over the entire time series. The 'historical p value' of the endpoint of an EWS candidate is the percentage of endpoints in the histogram that are larger than it. Only endpoints within lowest historical p values are considered as EWS candidates. A more careful reliability analysis will be presented at the end of the Results and Discussion section.

Choice of parameters and sensitivity analyses
Choice of parameters. In this study, the parameters we have freedom to adjust are summarized in Table 3. Sensitivity analyses. We performed two sets of sensitivity analyses in this paper. In the first, we determined the optimal combination of parameters to detect EWSs of large movements in the FOREX market.
To do this, we must identify the events we sought to forecast. In order to quantitatively pick out sudden shifts in the exchange rate, we consider a time period Y starting from the end of the T 2 rolling window, to half a day afterwards, and define the maximum spread to be Here, E 0 is the exchange rate at the beginning of time period Y, E min is the minimum exchange rate within Y, and E max is the maximum exchange rate within Y. Basically, y ms measures the most extreme exchange rate variation within Y, relative to its starting value, allowing variations in either directions. This can be E 0 − E min , if it is larger than E max -E 0 , or E max -E 0 vice versa. Higher values of y ms correspond to more extreme exchange rate variations within Y.  To examine the performance of a certain combination of parameters, we first created the sets A, B1, B2, and C as shown in Fig 4. The corresponding 90 th percentile and 95 th percentile values of y ms are noted as y ms10 and y ms5 respectively (shown as the vertical blue line and the vertical red line in Fig 5(A)). Based on the intersections C \ B1 = {y ms 2 C|y ms > y ms10 } and C \ B2 = {y ms 2 C|y ms > y ms5 } as shown in Fig 5(B), we defined the 5% and 10% discovery rates of Set C, DR 5 and DR 10 , as where card() stands for cardinality, which is the number of elements in the set. We also defined the 5% and 10% specificities of Set C, SP 5 and SP 10 , as In this analysis, our objective was to choose parameters that maximize discovery rates and specificities.
In our survey of the literature on EWSs, we noticed that most studies confirmed EWSs preceding critical transitions, while other studies could not detect statistically significant EWSs. However, the qualities of data used in these analyses are highly uneven, in the sense that in some studies, very high frequency data was used, whereas in other studies, the data frequency was low. Because we had the good fortune of working with FOREX data at the highest frequency, we could create data samples over many orders of magnitude in data frequency. Therefore, in this second sensitivity analysis, we systematically test the effect of data frequency (determined by T 0 ) on the discoverability of a subset of very obvious true positives. A true positive is discoverable at a given data frequency if there are statistically significant EWSs preceding the true positive. In this analysis, we fixed the largest window size T 2 , but kept the product of T 1 and T 0 constant so that the number of indicator values used for significance testing (R ind ) is fixed (at 10), as we increased T 0 from the optimum (15 s or 30 s) to the very large value of 6 hrs.

Results
In Figs 6-8, we show the statistically significant EWSs obtained from individual EWIs (See Texts C, D, and E in S1 Protocol for Matlab script) with historical p values of their endpoints set to p 0.025 (See Text F in S1 Protocol for Matlab script), compared to concurrent EWSs (See Text G in S1 Protocol for Matlab script) with the same historical p value for the three different data sets. In these three figures, we also show the concurrent EWSs for a historical p value of their endpoints set to p 0.06, to illustrate how we can include more statistically significant EWSs. The parameters used to detect the EWSs are the optimal combinations for the three data sets. We will explain how these optimal parameter combinations are obtained shortly.
In Fig 7, the numbers of statistically significant EWSs predicted by the three indicators are roughly equal. Also, the statistically significant EWSs predicted by Var are mostly at similar times to those predicted by AC(1) and LFPS. However, in Figs 6 and 8, even though the numbers of statistically significant EWSs predicted by Var are roughly the same as those predicted by AC(1) and LFPS, those predicted by Var are concentrated in a small number of time periods. We believe this is because the variations of AC(1) and LFPS are within a narrow band of values, whereas the variations of Var can be over many orders of magnitude. Therefore, the condition for a strict historical p value for the endpoint of Var restricts the discovery of statistically significant EWSs to only periods with very high variance. Early warning signals in critical transitions From Figs 6-8, we see that the effect of relaxing the historical p value for the endpoints is that the additional concurrent EWSs being included are mostly close to those already included at the stricter historical p value. This gives us confidence that the EWSs are indeed consistent precursors to actual critical transitions. In fact, the bunching up of EWSs seen in the figures is consistent with the general pattern of flickering critical transitions being preceded by foreshocks and followed by aftershocks. More importantly, the sharpest decline in AUD-JPY exchange rate on 6 Oct 2008 in

Optimal combination of parameters
The sets of EWSs discovered depend on the parameter combinations that we used. Therefore, we performed the sensitivity analyses, where parameters are sequentially optimized for high discovery rates and specificities, as shown in Tables A, B, and C in S1 Appendix. From these tables, we concluded the optimal parameter combinations for AUD-  Tables A, B, and C in S1 Appendix produces less than 1% change in the discovery rate (DR5) and specificity (SP5) (see Tables D in S1 Appendix). The discovery rate and specificity are most sensitive to changes in R ind and R win , although the percentage changes are still small. Early warning signals in critical transitions

Effects of increasing time interval
Following this, we turned our attention to the key question in this paper: whether the EWSs can always be detected in lower-frequency data. In this analysis, we focused on increasing the time interval from 15 s to 6 hr (see Table 4), checking if the EWSs discovered at optimal T 0 (15 s and 30 s) were also discovered at longer time intervals.
As we can see from For the FOREX market, whose dynamical time scale is of the order of 1 to 5 seconds, and where the largest crash is over in a matter of 10 to 15 minutes (see Fig 1(A)), it is surprising that we could even have semi-reliable EWSs with data frequencies up to 2 minutes! When we zoom in to Fig 7 for a closer look, we find that there were up to 3 days of EWSs before the largest crash on 6 Oct 2008. Going through the parameter combinations in Table 4, we inferred Early warning signals in critical transitions that these signals were fully captured by the last rolling window, and partially captured by the second last rolling window. This means that out of the ten indicator values that went into the Kendall-tau test, only the last two indicator values contained contributions from the actual EWSs. For time intervals beyond 2 minutes, the 3 days' worth of EWSs were also only captured by the last two rolling windows. But because fewer data points containing early warning information were sampled during these 3 days, the signal-to-noise ratio becomes smaller. Therefore, we deduced that the deterioration of EWSs for larger time intervals is the result of undersampling of residue data points within the EWS periods. In other words, low data frequency could significantly compromise the performance of EWSs.  As a caveat, let us note that in this test, we used unusually large 600-hr rolling windows for all time intervals T 0 . This was to accommodate the largest 6-hr time interval that we included in the test. Technically, the EWSs presented here are not the most reliable, because they are obtained in a way that is far from ideal, resulting in only a few of them that are sparsely distributed in time. This is unlike the robust consecutively EWSs within proper time periods for the optimal combination of parameters. Moreover, the 600-hr rolling window is much larger than the 3-day period of actual EWS, which therefore must stand out against more noise from the rest of the rolling window. Additionally, to make comparisons, we also had to relax the criteria for historical p value to be able to detect a decent number of EWSs. In so doing, even the reliability of EWSs from residue time series at time intervals below 2 minutes is not as high as that of EWSs obtained with the optimal combinations. Nevertheless, the test convincingly shows that large time intervals (beyond 2 minutes) could not produce reliable EWSs. The reliability of EWSs whose time intervals are within 2 minutes were not verified in this section, however we do know that the optimal time interval ranges from 15 seconds to 30 seconds, from the optimal combinations in the previous section.

Reliability analysis
Finally, to quantify the performance of our EWSs, we examined the conditional probability for a large maximum spread to occur after an EWS, as well as that for a large maximum spread to occur without an EWS. To do so, we examined the maximum spreads (y ms ) by the end of every rolling window that was used for computing EWSs (defined by optimal R win and R step ) across the whole time period, and check: (1) whether the maximum spread is within the top 5 percentile, and (2) whether such a large maximum spread is preceded by at least one recent EWS, with p < 0.05 for Kendall-tau and the historical p < 0.04 for the endpoints. By 'recent', we mean that the EWS ended within the last 0.9 days (excluding weekends), even though it may have started much earlier. The maximum spreads y ms are computed within the time window of 0.1 day starting from the end of every R win rolling window. We chose to have the time between the end of the EWS and the end of the maximum spread time window to be one day as one day is expected to be a reasonable time to make a decision in this highly liquid FOREX market. To be fair, we used the same 0.1-day time window for large maximum spreads that are not preceded by a recent EWS.
From the pool of all R win rolling windows, we estimate P 1 and P 2 as If P 1 = 1, all EWSs should be followed by large (top 5 percentile) maximum spreads. This means that the EWSs provide very precise predictions on subsequent exchange rate movements. If P 1 < 1, then some EWSs are not followed by large maximum spreads, so overall the EWSs are less precise. If we act on them to short the exchange rate in question, we may lose the opportunity to make a killing shortly afterwards, but we will not sustain unexpectedly large losses. On the other hand, there can also be large maximum spreads that occur in the absence of EWSs. We can incur large losses if we believe wholeheartedly that no EWSs mean no large maximum spreads afterwards. The proportion of such events, out of the set of cases with no recent EWSs is given by P 2 . From all indicators in all data sets, we found that P 2 is at most 0.05. For the EWSs to provide reliable predictions, it is necessary to have P 1 > P 2 . In fact, the larger the ratio P1 P2 , the more confident we are at avoiding losses when we act upon the EWSs. Indeed, as can be seen from Figs 12-14 (See Text I in S1 Protocol for Matlab script), the pool ratio P1 P2 averaged over all times is greater than 1 for all indicators in all data sets. In the worst case, for AC(1) of CHF-JPY, this ratio is 1.73, whereas in the best case, for Var of AUD-JPY (2005-2010), the ratio is 11.93. These performances determine what we would have gotten from acting on the EWSs all the time for the three data sets.
Since it is customary for traders to test a new strategy over a finite time period before adopting it, we also tested the reliability of the EWSs over various shorter time periods. For example, to test the reliability of the EWSs on the scale of 250 trading days, we created a statistical ensemble of 250-trading-day time period with 100,000 random starting times. We then computed the histograms of P1 P2 and P 1 over this ensemble, as shown in Figs 12-14 for the data sets  Fig 13(C), which is 0.66. This implies that the EWSs are informative most of the time, and perform better at predicting large maximum spreads than just pure guessing. The expectations of P1 P2 for the ensembles are close to their pool values marked as black vertical lines in the limit of large sample size (100,000). These expectation values are even larger than 1, meaning that on average the EWSs carry significant information on predicting large maximum spreads. The histograms of P 1 ((b), (d), and (f) of Figs 12-14) show the distributions of precisions of EWSs, with their pool values marked as black vertical lines. From these, we can see that most pool values are larger than 0.1, with only two exceptions in Figs 12(F) and 14(B). Note that in Fig 14, the bands are highly concentrated. This is because the CHF-JPY data set contains only 382 trading days, which is not large enough to sample many 250-trading-day windows with random starting time. In comparison, AUD-JPY (1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004) and AUD-JPY (2005-2010) include 2608 and 1311 trading days respectively, which is large enough for this test.
To see how the performances of EWSs measured by rates of exceeding 1 change with varying trial time periods, we repeated the same sampling procedure with a growing time period starting from 10 trading days up to 400 trading days in steps of 10 trading days. The results are shown in Fig 15 (See Text J in S1 Protocol for Matlab script). From Fig 15(A) and 15(B), we see that rates of exceeding 1 increase monotonously with trial time period, and gradually approaches the upper bound of 1, except for Var in Fig 15(B), which grows very slowly. This implies that in practice, the overall performances of EWSs are expected to improve if they are tested for a longer time. In Fig 15(C), we tested up to 280 trading days for CHF-JPY since it only contains 382 trading days' worth of data. From Fig 15(C) we see an odd trend in the AC (1) curve from 150 trading days onwards, which might be a result of the small size of the CHF-JPY data set, limiting the variation of the starting time of long-time-period samples.
Conditions for EWSs. We have shown in this paper that statistical significant detection of EWSs is very sensitive to (1) the intrinsic early warning period for each extreme event, (2) the frequency of data points in the time series, and (3) the choice of test statistic for which the EWSs would be statistically significant. If the intrinsic early warning period is too short or the data frequency too low, we might end up with an insignificant value for the Kendall-tau even if we have independent and reliable validation of the critical transition tested.
Working with the stringent Kendall-tau statistic represents a desire by the early warnings community to be strict with which events they can claim as critical transitions. The data frequency is frequently within our control: if the experimental method and cost permit, we can always collect more data points per unit time. However, the intrinsic early warning period, which is the period of time the complex system we study re-organizes and move endogenously towards the critical transition, is something that we may have little control over. Moreover, we have no theoretical justification that critical transitions of the same scale have similar intrinsic early warning periods. A large critical transition may thus be accompanied by a short early warning period, and we would then simply miss its early warnings.
The impact of accidental noise sequences. In this final subsection, we discuss how robust our conclusions are, when there is noise in the time series data. The first question we would ask is how likely it is for us to observe a statistically significant EWS that is due entirely to random noise. In some sense, this is also the easiest question to answer: the probability of a series of purely random noises producing an EWS that is statistically significant is given by the p value of our statistical test. In all our tests, which involve reshuffling the time series data to obtain a statistical ensemble of artificial data that has no serial correlation in time, this probability is at the level of less than 0.05, i.e. no more than 5% of the EWSs that we have identified can be due entirely to random noise.
The next question we might ask is how we can separate an accidental sequence of noises that is meaningless from an intrinsic trend that is meaningful. One might worry that these two cannot be disentangled when we use high-frequency data, especially when the time scale over which the critical transitions occur is short. We made clear in the Effects of increasing time interval subsection that (1) the typical time scale over which extreme movements of the FOREX market occur is around 15 minutes, which is already 2 orders of magnitude more than the time interval T 0 we used in our analyses, and (2) intrinsic trends that preceded large exchange rate variations, which we call the early warning periods, lasted up to 3 days. Again, this time scale is very much larger than T 0 . There is thus no worries that fluctuations at the scale of T 0 will impact the conclusions we arrived at, because for this to happen, we would need the fluctuation to be accidentally correlated over thousands to ten thousands of time steps, which is extremely unlikely.
The last question concerns more the correct identification of booms/crashes. This is a fair question to ask for a paper like ours, but is one that is extremely difficult to answer. In the stock market, there have been many attempts to define market crashes, but none of these definitions are universally accepted because they are not based on a mechanistic understanding of the market. In place of a rigorous definition, researchers have resorted to studying market crashes that are reported in the popular press. These are frequently the most pronounced crashes, and therefore are the least controversial. Many smaller crashes are likely to have been missed, because they are not picked up by financial news reporters.
In particular, with the advent of high-frequency algorithmic trading, flash crashes of the order of 10% in market value but lasting several minutes are not uncommon in major exchanges of the world. These are assumed to be due to glitches in the trading algorithms, but are poorly documented and studied. A similar problem plagues the FOREX market. Because of the shorter time scale on the FOREX market, one naturally expects many more booms and crashes in a given period of time. These events are rarely picked up by financial news reporters, so we do not even have a curated list of the most uncontroversial events to work with. This is why in this paper we used the 95 th percentile set of the maximum spread as a proxy for booms and crashes, because there is no ground truth we can obtain by alternative means.