A simulation and comparison of dynamic functional connectivity methods

There is a current interest in quantifying brain dynamic functional connectivity (DFC) based on neuroimaging data such as fMRI. Many methods have been proposed, and are being applied, revealing new insight into the brain’s dynamics. However, given that the ground truth for DFC in the brain is unknown, many concerns remain regarding the accuracy of proposed estimates. Since there exists many DFC methods it is difficult to assess differences in dynamic brain connectivity between studies. Here, we evaluate five different methods that together represent a wide spectrum of current approaches to estimating DFC (sliding window, tapered sliding window, temporal derivative, spatial distance and jackknife correlation). In particular, we were interested in each methods’ ability to track changes in covariance over time, which is a key property in DFC analysis. We found that all tested methods correlated positively with each other, but there were large differences in the strength of the correlations between methods. To facilitate comparisons with future DFC methods, we propose that the described simulations can act as benchmark tests for evaluation of methods. In this paper, we present dfcbenchmarker, which is a Python package where researchers can easily submit and compare their own DFC methods to evaluate its performance.

While these methods and their applications may offer new insights into the functions of the brain and cognition, it becomes difficult to compare results when different studies use different methods to estimate brain dynamics. Each method is often introduced and evaluated by the authors' own simulations, empirical demonstrations, and/or theoretical argumentations. However, apparent differences in dynamic connectivity in different studies may have been influenced, or even caused, by differences in the underlying methodology used to derive connectivity estimates.
In order to maximize reproducibility of reported findings, it is important that comparisons of proposed DFC methods can be made with a common set of simulations. To this end, we have developed four simulations that aim to show how well results from different DFC methods correlate with each other and evaluate their performance of tracking time varying covariance. The proposed methods and simulations are packaged in the Python package dfcbenchmarker, (available at github.com/wiheto/dfcbenchmarker). Researchers can evaluate their own DFC methods in dfcbenchmarker. The software also allows for new methods to be submitted to us for inclusion in future reports. Here we demonstrate the functionality and results obtained by dfcbenchmarker by evaluating the performance of the following five methods: sliding window (SW), tapered sliding window (TSW), spatial distance (SD), jackknife correlation (JC), and temporal derivatives (TD).

Dynamic functional connectivity methods
As discussed in the introduction, the list of published DFC methods that are designed to be applied to fMRI imaging data is long. In an ideal world all methods will be contrasted under the same conditions so that an evaluation of which methods that give appropriate results can be performed. However, it was not our intention to provide a complete comparison of all published methods. Instead we have made all simulation tools freely available so that researchers can evaluate their own DFC methods. Before describing the simulations and the results, we provide a brief overview of the five methods that are evaluated in this article.

Sliding window (SW)
The SW method is one of the most commonly used methods to estimate DFC. The sliding window method uses a continuous subsection of the data, estimates the degree of correlation (Pearson correlation), slides the window one step in the time series, and repeats. This creates a smooth connectivity time series as neighbouring estimates of connectivity share all but two data points. The SW method is based on the assumption that nearby temporal points are helpful to estimate the covariance. In our simulations, the window length was set to 63, which is reasonable for a sliding window analysis. Given the common choice of a time resolution (TR) of 2 seconds in fMRI, this results in a window length of 126 seconds which is reasonable, given rule-of-thumb choices for window length estimates that has recently been proposed (35).

Tapered sliding window (TSW)
The TSW method can be described as a weighted Pearson correlation where the weights are set to zero except for the data points residing inside the window. This procedure is identical to the SW method except that a larger weight is placed on time points closer to the centre of the window (t). Often, the weights are distributed according to a Gaussian distribution centred at t. In our simulations using the TSW method, we used a Gaussian distribution with a variance of 10.
The window length was the same as for the SW method (centred at t). See also (15) for an example of usage of the TSW method.

Spatial Distance (SD)
An alternative to using temporally adjacent data points is to estimate the covariance using data points that have similar spatial profiles. In this case, the weights are based on distances in spatial dimensions of data acquired at different time points. In functional neuroimaging data each spatial dimension corresponds to a region of interest or voxels. One way to accomplish this is by first computing weights for each time point (t) to all other time points. This can be done by taking the inverse of their Euclidean distance to each other time point (u): where D is the Euclidean distance. Hence, the closer in space the data points are located, the larger their individual weights will be. Each time point gets a weight vector (w t ) which is the length of the time series. The weight vector for each time point are subsequently scaled between 0 and 1. The "self weights" (w t t ) are set to 1. The connectivity estimate at t is the weighted Pearson correlation where each time point is weighted by w t . For more details of the SD method, see (8).

Jackknife Correlation (JC)
The JC method has previously been shown on electrocorticographic data for signal trial coherence and Granger causality (36). To the best of our knowledge, the jackknife correlation method has not yet been utilized in the DFC literature. Thus, we provide a more detailed description of its logic and workings. The jackknife correlation method is outlined in more detail and contrasted to a binning approach (which is akin to the sliding window method) in (36). The JC method, when applied to single time point covariance estimates of signals x and y at t computes the Pearson correlation between the two signals using all time points in x and y with the exception of x t and y t : Of note, the inclusion of the minus sign in the equation above is to correct for the inversion caused by the leave-n-out process (see below). Thex t andȳ t are the expected values, excluding data at time point t: To demonstrate the JC method, 10,000 time points were drawn from a multivariate Gaussian distribution with a mean of 0 and a variance of 1 to generate the two time series shown in Figure 1A. Additionally, the time series were constructed so that the covariance between the two varied as a function of time. For the first 2,000 time points, the covariance was set to 0.8 and then further decreased in steps of 0.2 for every 2,000th time point ( Figure 1B).
The relative connectivity time series are similar (but inverted) for a leave-nout compared to a window-length-n methods ( Figure 1C). The JC method corresponds to the case when n=1, i.e. a leave-1-out approach after correcting for the inversion (all leave-n-out estimates are multiplied by -1 to correct for the inversion in Figure 1). All possible choices of n were computed for both methods (i.e. a leave-1-out to a leave-9,999-out and a window-of-length-2 to window-length-10,000 (see Figures 1DE). As shown in Figure 1G, the Spearman correlation between the two methods is close to 1 for various choice of n. However, their correspondence in covariance estimates between the two methods departs at the tails ( Figures 1F and 1H). These deviations occur for two reasons: (1) when n is very small it implies there is little data to work with for the window-length-n method. On the other hand, low values of n do not hamper the performance of the leave-n-out method. (2) Large values of n will result in few estimates for covariance for each time point, which makes the correlation between the two methods less stable. In sum, while it is impossible to create estimates for window-of-length-1, it is however possible to use a leave-1-out method as an approximation for a window-length-1 due to the symmetry between the two methods.
When estimating the DFC, the two major aims are to accurately measure the covariance and to be sensitive to changes in the covariance. In the case of the leave-1-out (i.e. the jackknife correlation) approach, we achieve a unique connectivity estimate per time point that is more reliable than using a smaller window size (due to the fact that more data is used). Usually, the SW method has to find a balance between the two aims. In this respect, the JC method is an optimal sliding window method as it does not have to compromise between sensitivity on one hand and accuracy on the other.
The time point based DFC estimate obtained with the JC method should be interpreted as the relative difference in connectivity at any particular data point compared to all other data points in the time series. This is because the covariance for each data point is estimated based on its relationship to all other data points. To illustrate this effect, consider the 49 data points randomly sampled from a Gaussian distribution with a mean of 0 and a covariance of 0.5 as shown in Figure 1I. If we assume that the 49 time points are used to compute the JC estimate for the covariance for a 50th time point, the value of this new  Figure 1J. In sum, individual JC estimates have little meaning by themselves but they do show which time points have an increased connectivity and which that have a decreased connectivity relative to all the other points in the time series.
When using the JC method to estimate DFC, it is important to keep in mind that it leads to a compression of the variance. Furthermore, the amount of compression is proportional to the length of the time series. It is often helpful to scale or standardize the connectivity time series derived by the JC method before any subsequent analysis. Finally, while a Pearson correlation was used in this study for the JC, it is possible to use other correlation methods such as the Spearman Rank instead.

Temporal Derivative (TD)
The TD approach to estimate DFC was first introduced in (16). In brief, the temporal derivative method first computes the temporal derivative of a time series as: Next, the coupling between the signal sources i and j is defined as the product of the two derivatives df i and df j for each time point t, divided by the product of the standard deviation for df i and df j : The TD method is often used together with a smoothing function in the form of a window function. In our simulations, a window length of 7 was chosen, since this was considered optimal in (16).

Post-processing for DFC estimates
After each of the DFC methods were applied to the simulated data a Fisher transform was applied to the connectivity time series (except for the TD method).
To illustrate the variance compression that results from the JC method, the DFC for the JC method was not standardized in Simulation 1.
The SW, TSW and TD methods are all effected by any autocorrelation existing in the BOLD signal, whereas the spatial distance and jackknife correlation method are not effected by any autocorrelation in the time series.

Simulations
To compare accuracy and performance for the five DFC methods, we performed four different simulations. The first simulation investigated the similarity of the five methods by correlating their respective connectivity estimates. The second simulation targeted how well the different methods were able to track a fluctuating covariance parameter. The third simulation tested how robust the estimated fluctuating covariance is when the mean of the time series fluctuates, mimicking the haemodynamic response function. The forth simulation considered whether DFC methods can accurately track abrupt changes in covariance.
Simulations 2, 3, and 4 all consisted of fluctuating covariance parameter (r t ) that was used to generate the time series. This is the covariance between the time series that the data was sampled from at t. The parameter choice for the difference simulations were considered relevant for the fMRI BOLD signal. A full account of all model assumptions made as well as a justification for our model parameter settings for the four simulations models used in the present study is given in the Supplementary Material.

Statistics
In principle, it is possible to simply correlate the results from the different DFC methods with the r t values to statistically evaluate their performance. However given the inherent, but known, uncertainty in r t , we deemed it was appropriate to create a statistical model which accounts for this uncertainty. Thus, for each DFC method, a Bayesian statistical model was created to evaluate the relationship between the DFC estimate and the signal covariance.
The Bayesian model aims to predict y, which is the vector of the known sampled covariances (i.e. r t ) with x, which is the connectivity estimate for each DFC model.
All DFC estimates and the values of r t were standardized prior to calculating the models, to facilitate the interpretation of the posterior distribution parameter β.
The different DFC methods vary in length of number of time points estimated (e.g. the beginning and end of the time series cannot be estimated with the sliding window method). In order to facilitate model comparison between methods, we restrained the simulations to include only the time points that had estimates from all DFC methods (i.e the limit was set by the SW and TSW methods which can estimate the covariance for 9,938 out of 10,000 time points).
The statistical model were estimated through 5,500 draws from a Markov Chain Monte Carlo (MCMC) with a No-U-Turn Sampler (37)

Simulation 1
The first simulation aimed to quantify the similarity of the different DFC time series estimates. If two DFC methods are strongly correlated, this is a positive sign as they are estimating similar aspects of the evolving relationship between time series. A negative correlation between two methods would suggest that they do not capture the same dynamics of the signal.
In this simulation we created two time series (X), each consisting of 10,000 time points in length. The time series had an autocorrelation and covariance given by: The autocorrelation with lag of 1 is determined by αX t−1 and the covariance at t is determined by . was sampled from a multivariate Gaussian distribution (N ): where µ is the mean and Σ being the covariance matrix of the multivariate Gaussian distribution. Both time series were set to have a mean of 0, variance of 1 and a covariance of 0.5. In summary: The autoregressive parameter α controls the size of the autocorrelation in relation to the preceding time point (i.e. the proportion of the previous time point that is kept). Here, it was set to 0.8 which was deemed to be an appropriate degree of autocorrelation for BOLD time series. A portion of the two simulated time series is found in Figure 2A together with the plots of their respective autocorrelation (Figures 2BC) and a plot of the correlation between the two time series ( Figure  2D). The resulting connectivity time series for the five different DFC methods when applied to the simulated data is shown in Figure 3. From Figure 3, several qualitative observations can be made about the methods. Firstly, there was a very strong similarity between the SD and JC methods, despite the fact that they consist of quite different assumptions. Further, the SD, JC, and TD methods were all able to capture considerably quicker transitions than the SW and TSW methods. Finally, the variance of the JC method was considerably smaller than all other methods, illustrating the variance compression as previously discussed. To assess the degree of similarity of the estimates of functional connectivity time series obtained from all five DFC methods, a Spearman correlations was used ( Figure 4). The connectivity time series estimates from all methods correlated positively with each other (Figure 4). The SD and JC methods showed a strikingly strong relationship (ρ = 0.976). The two sliding window methods also display a strong correlation (ρ = 0.735). The lowest correlation was found between the SW and TD methods (ρ = 0.086).
The results from Simulation 1 showed that the connectivity estimates provided by the tested methods are, to a varying extent, correlated positively with each other. It also illustrated how the different methods differ in their resulting smoothness of the connectivity time series. The results from this simulation cannot validate whether any DFC method is superior to any other, it merely highlights which methods produce similar connectivity time series.

Simulation 2
In Simulation 1, it was not possible to evaluate how well the different DFC methods perform. To evaluate the performance, the simulated data must change its covariance over time and how this change must be known beforehand. The aim of this simulation was to see how well the derived DFC estimates can infer the covariance that the data was sampled from when the covariance is fluctuating.
Two time series were generated (X). Each time point t is sampled from a multivariate Gaussian distribution: where the covariance matrix was defined as: and where the variance, σ = 1, was set to 1. At each time point, r t was sampled from another Gaussian distribution: The mean of the time series (µ) was set to 0, the mean of the covariance (µ r ) was set to 0.2. The variance of the fluctuating covariance (σ r ) was set to vary with the values of {0.08, 0.1, 0.12}.
The covariance at time (r t ) was sampled from a Gaussian distribution. Each time point received a new value of r t . This allowed us to compare each DFC method's connectivity estimate in relation to the time varying covariance parameter r t . Note, r t is not a definitive measure of the covariance at t, it is only a parameter of the covariance used to sample from a Gaussian distribution. Thus, we should not expect the connectivity estimate from any method to correlate perfectly with r t . However, it is possible to compare which method correlate best with r t to evaluate the overall performance.
The above model will have a temporally fluctuating covariance. It fails to include any autocorrelation in the time series. Not accounting for this may bias the results for some of the tested methods that utilize nearby temporal points to assist estimating the covariance. Merely adding an autocorrelation, like in Simulation 1, will also increase the covariance between the two time series and this will not be tracked by r t . To account for this, we placed a 1-lag autoregressive model for the fluctuating covariance at r t : Where α is the autocorrelation parameter. The values for µ r and σ r were the same as above. When t = 1, was set to 0.
This revised formulation of our simulation model allowed for the covariance to fluctuate, but with an added autocorrelation on the covariance parameter. In simulation 2, three different settings of the parameter α were used (α = 0, 0.25, 0.5). When α = 0 it is equivalent to the original model outlined above with no autocorrelation. With an increased α it entails a greater influence of the covariance from t − 1 in sampling the covariance at t. α = 0.5 is reasonable given highly correlated BOLD time series. An α = 0 is more to be expected when time series are less correlated. 10,000 time points were sampled for each of the three different settings of the autocorrelation parameter. See also the Supplementary Materials for a justification of the parameter settings chosen here.
Simulation 2 was run with 9 different simulation hyperparameter combinations: three different values of α and three different values of σ r .
A sample of time series generated with the model using different settings for the autocorrelation parameter α is shown in Figure 5A. Due to the varying degree of autocorrelation, the mean covariance for time series changes as a function of α, but all still create a Gaussian distribution of r t (Figures 5B-D). The degree of crosscorrelation between the two time series followed the specified α parameter for the autocorrelation of the covariances (Figures 5E-G).
The results from Simulation 2 are shown in Tables 1-9. The JC method had the lowest WAIC score for all settings of α, followed by the SD method. The SW, TSW and TD methods performed nearly equally well for α = 0. The TD method came in third place for all but one parameter configuration. All WAIC values, their standard error and ∆ WAIC scores are shown in Tables 1-9.
The posterior distribution of the β parameter for each of the DFC methods for all parameter choices are shown in Figure 6. Larger values in the β distribution for a method (i.e. correlating more with r t ) conforms with the best fitting models (i.e. lower WAIC score). The SW method performed the worst, followed by the TSW method. The TD method came in third place. SD and JC showed the best performance, with similar posterior distributions of β, although the JC was always slight higher.
There was little difference between the methods when changing the variance of the fluctuating covariance (σ r ). The β values do however scale when σ r changes. When σ r is smaller, β values decrease.
In sum, the JC method, followed closely by the SD method, showed the best performance in terms of tracking a fluctuating covariance between two time series as performed in Simulation 2. The TD method ranked in third place when there is a higher crosscorrelation between the time series present. The SW and TSW methods showed the worst performance, both in the WAIC score (Tables  1-9) and posterior distributions of β ( Figure 6).

Simulation 3
The aim of Simulation 3 was to examine the behaviour of different DFC methods when there were non-stationarities present in the data. A typical scenario when this will occur is in DFC is in task fMRI. Simulation 3 is identical in structure to Simulation 2 apart from the following two changes: (1) A non-stationarity, aimed to mimic the occurrence of an event related haemodynamic response function (HRF). Specifically µ, which was set to 0 for both time series in Simulation 2, received a different value at each t (see next paragraph). (2) σ r was set to 0.1 instead of varying across multiple values. This is because Simulation 2 showed no large differences when varying σ r .
µ t was set, for both time series, according to the value of a simulated HRF, that was twenty time points in length and repeated throughout the simulation. The HRF was simulated, with a TR of 2, using the canonical HRF function as implemented in SPM12 using the default parameters (38). This HRF, which has a length of 17 time points, was padded with an additional 3 zeros. The amplitude of the normalized HRF was multiplied by 10 to have a high amplitude fluctuations compared to the rest of the data. µ t is thus the padded HRF repeated throughout the entire simulated time series. This represents a time series that includes 250 "trials" that each lasts 40 seconds. This simulation helps illustrate how well DFC methods could be implemented in task based fMRI. Examples of the time series generated using different autocorrelation are shown in Figure 7.
The results from Simulation 3 are shown in Figure 8 (Posterior distributions of β) and Tables 10-12 (model fit) which evaluated each DFC's method performance at tracking the fluctuating covariance (r t ). Results were similar with Simulation 2. In the case when the autocorrelation of the covariance was 0, the SW, TSW and TD methods performed quite poorly, but all improved to varying degrees as this increased. The SW method was generally the worst method, followed by TSW. The TD method came in third place. The JC method has the best performance, followed closely by the SD method, in all parameter conditions.
In sum, the results from Simulations 2 and 3 suggests that the JC method has the best performance in terms of detecting fluctuations in covariance compared to the other four DFC methods. This result also holds when a non-stationary event related haemodynamic response was added to the mean of the time series.

Simulation 4
Simulation 4 aimed to test how sensitive different DFC methods are to large and sudden changes in covariance (i.e. changes in "brain state") that previously have been postulated to exist in fMRI data (e.g. (11,15,17)). We here start in a similar fashion as we did in Simulation 2 where samples for the two time series are drawn from a multivariate Gaussian distribution . For each parameter configuration, a model was created for each DFC method. The DFC estimate was the independent variable estimating the fluctuating covariance (r t ) between the two time series.
Similar to simulation 2, we set µ t = 0 and σ = 1. The covariance parameter r t was sampled from a Gaussian distribution where the mean was shifted These values correspond to the number of time points a "state" lasts. Beginning at t = 1, µ statet to µ state t+l was randomly sampled from M where l was sampled from L. This procedure was continued until X t was 10,000 samples long.
These choices for brain state changes provide time scales of state transitions between 40-120 seconds (slow condition) or 4-12 seconds (fast condition) in simulated fMRI data with a TR of 2 ( Figure 9A). The statistical model for evaluating the different DFC methods performance was the same as Simulation 2 and 3. A summary of data generated in Simulation 4 is shown in Figure 9A-F.
The results from Simulation 4 are shown in Figure 10 and Tables 13-14. In the quick transition condition, the JC and the SD showed the best performance for both the WAIC scores and the posterior distribution of β ( Figure 10A; Table 7). In the slow transition condition the two sliding window methods outperformed the other methods ( Figure 10B; Table 8). The JC and SD methods perform similarly for both conditions. Thus, when there are shifts in covariance that occur relatively slowly, the sliding window methods are sensitive at tracking these changes.

Discussion
In this study we have developed four simulations to test the performance of different proposed dynamic functional connectivity methods. The first simulation showed which methods yield similar connectivity time series. Notably, all methods correlated positively with each other, but to a varying degree. The second simulation generated data in which the autocorrelated covariance between simulated time series varied in time. In this case, the JC method, followed closely by the SD method, showed the best performance. In the third simulation, the generated time series contained a non-stationary mean related to haemodynamic  responses. Again, our simulations suggested that the JC method performed best. The forth simulation included nonlinear shifts in covariance (in an attempt to simulate brain state shifts). When the states changes were quick, the JC method performed best. When the state changes were slow, the TSW (followed by the SW) performed best.
In a previous simulation that evaluated the sliding window method, the sensitivity of SW and TSW method was found to be good at detecting state shifts (39).
Here, at least when the transitions are slow, we found similar results. The sliding window methods is optimal if there are slow state changes. However it is unclear if "state changes" are the best yardstick for dynamic brain connectivity.
In particular, non-stationarities in dynamic connectivity have been attributed to spurious sources such as movement (12). Given the unknowns of the "true" connectivity, methods which are robust over conditions are more likely the safer options -in this case the JC or SD method performed similarly in both conditions.
Overall the jackknife correlation method performed the best across all simulations.
We have shown it to be robust to numerous changes in parameters. However, the JC method is not without some considerations. First, it introduces variance compression that reduces the absolute variance, while preserving the relative variance within the time series. This variance compression also scales with the length of the time series. The consequence of this is that direct comparisons of the DFC variance between cohorts/conditions becomes hard to interpret as dynamic fluctuations, especially when the length of the data varies. However, this is the case for most methods and it should be remembered that the variance is proportional to the static functional connectivity (7,9,10). Simply put, the JC method (like all other methods) should not be used for a direct contrast of the variance of DFC time series. Second, the JC method sensitivity means that noise will be carried over per time point instead of being smeared out over multiple time points. This is actually beneficial as it allows for further processing steps to be applied that aims to remove any remaining noise (e.g. motion) which cannot be done when the noise has been smeared across the connectivity time series (e.g. in windowed methods).
The simulations and results presented in this study should not be taken as an exhaustive and complete assessment of all aspects of a given method to conduct DFC. Rather, the four simulations described here represents a subset of possible scenarios in terms of different methodological characteristics that might be of interest. The current four simulations are marked dfcbenchmarker simulation routine V1.0. If modifications or additional scenarios are considered to be improvements to the current simulations, these will get an updated version number. Many additional simulations could be conceived of on top of this original routine. For example, one could include multiple time series, adding movement type artifacts, adding frequency relevant characteristics, a stationary global signal etc. These have not been included here, as the focus in these simulations was to primarily assess tracking of a fluctuating covariance. Input 21 from researchers about appropriate additions to the simulations is welcome.
We encourage researchers designing DFC methods to benchmark their own results with dfcbenchmarker (www.github.com/wiheto/dfcbenchmarker). Researchers need only to write a Python function for their method and use it as an input for dfcbenchmarker.run_simulations() and their method will be compared to the five methods presented in this paper (see online documentation). Functions can then be submitted through the function dfcbenchmarker.send_method(). All valid methods submitted will be released in summaries of the submitted benchmarked results so that researchers can contrast the performance of different methodologies.                      In simulation routine V1.0 of dfcbenchmarker, we wanted the simulations to either be justified based on the properties of fMRI data or, in the case of Simulation 4, reflect what dynamics are hypothesized to exist in fMRI data.

Tables
Since the properties of the simulations are determined by their parameters, we justify these parameters in three different ways. First, we show that some parameters have no substantial effect on the result. Second, some are justified from previous findings/hypothesis in the literature. Third, by use resting state fMRI data to justify the assumptions.

fMRI data used
The midnight scanning club (MSC) data (1) is used to justify the assumptions. The dataset contains ten subjects with ten sessions of resting state fMRI. This data was obtained from the OpenfMRI database (accession number: ds000224). The preprocessed volumetric data was used. To reduce the dimensions of the data, 278 spherical regions of interest were defined (5mm radius) with their the centre of mass defined by the Shen et al 2013 atlas (2) (using the Talairach coordinates). Subject MSC08 is often reported as an outlier, which has previously been noted in Gordon et al where they noted this subject had poor data quality due to subject head movement and sleep (1). When a single subject and and session is shown (for illustration purposes), subject MSC10 and session 7 was used. This subject and session was selected at random. When specific ROIs were selected to be shown, these were randomly generated (from subject MSC10, session 7).
See github.com/wiheto/dfcbenchmarker_assumptions/1.0/assumptions.ipynb for the Jupyter notebook to run all the analysis from start to finish. µ and σ for all simulations were set to 0 and 1, with the exception of µ in simulation 3. As long as the time series have a stationary mean and variance, then these parameters can be set to anything and have no effect on the overall result. Simulation 3 tests how methods deal with the non-stationartity of µ.

Autocorrelation of the time series (Simulation 1).
The auto-correlation (α) is an important parameter for simulation 1. In Figure  S1, we can see the average auto-correlation for a single subject between 0 and 10 lags. The average and standard deviation for each subjects and session is shown in Figure S2. Excluding the sessions from the subject known to be noisy (MSC08), there were only a few sessions with average auto-correlation values outside the range of 0.70-0.85. Most subject averages were close to 0.80. Accordingly, α in simulation 1 was set to 0.80.

Autocorrelation of r t (Simulation 2 and 3)
The autocorrelation of r t is an important parameter for simulation 2 and 3, where the parameter α now refers to the autocorrelation of the covariance parameter r t . The expected crosscorrelation will be different for different values of static functional connectivity. Thus, we binned each edge based on its static connectivity bins (each bin was placed between -1 and 1 in spaces of 0.1). Figure  S3A shows the average crosscorrelation of the example subject and session for 10 lags for each bin. The general pattern is that at lag 1, the average correlation tends towards zero compared to lag 0 (which is the static functional connectivity). The frequency of the number of edges across the different correlation values is at lag 0 ( Figure S3B) and lag 1 ( Figure S3C) shows that the majority of edges are within the range -0.5 and 0.5.
The average crosscorrelation at lag 1 looks very similar across different subjects and sessions ( Figure S4). When pooling all static functional connectivity values across subjects and sessions, the histogram ( Figure S5) shows that the pattern at a group level is similar to that shown in Figure S3B. Thus, more edges correspond to a degree of connectivity that would be modeled with a lower α parameter. However, for edges with larger degree functional connectivity, a higher α is to be expected.     Covariance (µ r , σ r , r t ) σ r dictates how wide r t is going to fluctuate over time. What the true value of σ r is, is one of the key questions dynamic functional connectivity wants to answer. Making σ r higher, entails that there is a larger parameter space that r t gets sampled from which decreases the noise when estimating r t . When the distribution that r t is sampled from widens, there is a lower probability that the time series will be sampled similarly, meaning the inherent uncertainty of estimating single samples of r t decreases when the variance of r t increases. The effect of changing σ r is that the β values will scale with it, but the relative difference of the β values between methods will remain similar. The σ r parameter in is set to 0.08, 0.1 and 0.12 to illustrate this in simulation 2 to show that this has little effect other than a scaling effect of β ( Figure 6). In simulation 3, only 0.1 was used for σ r .
The average connectivity (µ r ) was set to 0.2 in Simulation 2 and 3. It can be seen in Figure S5 that this is a reasonable positive connectivity value. The µ r parameter is unchanged throughout simulation 2 and 3, however due to the fluctuating α parameter the effective covariance between the two time series increases as a function of α ( Figure 5). It can be seen that this effective increase in covariance had little effect on the different methods ( Figure 6, Figure 8).
The r t in simulation 4 switched between a very high connectivity "state" of 0.6 and lower, but positive, connectivity "state" of 0.2. These are plausible connectivity values ( Figure S5). See the section below on state change assumptions for more details.

Gaussian assumption
All simulated time series were sampled from a (multivariate) Gaussian distribution. The simulations can be improved by considering frequency specific information (see caveats below). The amplitude of the time series for ROIs in the example subject and session have a unimodal shape and relatively unskewed distributions ( Figure S6A). To illustrate this more clearly, Figure S6B shows three randomly selected ROIs. When looking at all groups the average skewness was calculated ( Figure S7). Given the distributions are unimodal they too can at least be approximated with a Gaussian distribution. While not perfect, and still a simplification, a Gaussian distribution is a reasonable distribution to sample from.
The Gaussian distribution assumption for sampling r t can also be motivated by (3) where the dynamic functional connectivity was generally unimodal and, with

Trial length and HRF scale
The scale of the HRF amounts to how much the standardized HRF has been scaled so that it creates a larger non-stationarity in mean of the simulated data in Simulation 3. This parameter was set to a high value to illustrate each method's ability to estimate connectivity when a non-stationarity mean is present in the data. There will be little effect when changing this, but if it was reduced it will eventually be identical to simulation 2. The non-stationarity introduced by the adding the HRF function to the data is shown in Figure 7.
The HRF lasting 20 seconds with an assumed TR of 2 means that there are 40 seconds between trials. This is quite long for an event related fMRI experiment. If this was reduced, it would just entail that the HRF stacks due to overlap. This would have no effect on what the simulation is testing which is just how well the different methods can act upon non-stationarities. The TR was considered to be 2 seconds because many studies still use this TR despite lower TRs being available. However, this has no implications on the results.

Assumptions regarding shift in "brain state"
The duration of a state in the quick condition are on the approximate time scale found in (4), although slightly quicker (where the average state transition was 7 seconds). The longer transitions are based on the time scale explored in (5). Note however that these two different time scales for states originate from different dynamic functional connectivity methods. Detecting brain states has, to our knowledge, not been justified without using dynamic functional connectivity which makes the reasoning to justify this parameter somewhat circular. Simulation 4 was added primarily to evaluate how the different methods can capture dynamics that many researchers hypothesize dynamic functional connectivity to be, not necessarily how the data is.

Caveats
The current simulations do not take into account any frequency information of fMRI brain connectivity which is known to play a role in resting state fMRI. This means that any methods that utilize this aspect of the signal (e.g. phase) cannot be evaluated at the moment. The Gaussian assumption of the distributions can be improved upon by adding more fMRI specific noise. Moreover, negative correlations were not considered in the simulations.
The autocorrelation of r t in simulations 2 and 3 do not necessarily entail an autocorrelation of X. While this could in principle be added to the simulation