Exploring the reproducibility of functional connectivity alterations in Parkinson’s disease

Since anatomic MRI is presently not able to directly discern neuronal loss in Parkinson’s Disease (PD), studying the associated functional connectivity (FC) changes seems a promising approach toward developing non-invasive and non-radioactive neuroimaging markers for this disease. While several groups have reported such FC changes in PD, there are also significant discrepancies between studies. Investigating the reproducibility of PD-related FC changes on independent datasets is therefore of crucial importance. We acquired resting-state fMRI scans for 43 subjects (27 patients and 16 normal controls, with 2 replicate scans per subject) and compared the observed FC changes with those obtained in two independent datasets, one made available by the PPMI consortium (91 patients, 18 controls) and a second one by the group of Tao Wu (20 patients, 20 controls). Unfortunately, PD-related functional connectivity changes turned out to be non-reproducible across datasets. This could be due to disease heterogeneity, but also to technical differences. To distinguish between the two, we devised a method to directly check for disease heterogeneity using random splits of a single dataset. Since we still observe non-reproducibility in a large fraction of random splits of the same dataset, we conclude that functional heterogeneity may be a dominating factor behind the lack of reproducibility of FC alterations in different rs-fMRI studies of PD. While global PD-related functional connectivity changes were non-reproducible across datasets, we identified a few individual brain region pairs with marginally consistent FC changes across all three datasets. However, training classifiers on each one of the three datasets to discriminate PD scans from controls produced only low accuracies on the remaining two test datasets. Moreover, classifiers trained and tested on random splits of the same dataset (which are technically homogeneous) also had low test accuracies, directly substantiating disease heterogeneity.

Introduction Although Parkinson's disease (PD) is the second most common neurodegenerative disease after Alzheimer's disease, its diagnosis is still difficult, especially in the early premotor stages, as it is mainly based on clinical evidence. To date, there is still no unique standard diagnostic test for PD, despite the intense research efforts to develop accurate biomarkers based on blood tests or imaging scans. The best current objective tests for PD evaluate dopaminergic function in the basal ganglia by using various PET or SPECT radiotracers (e.g. DaTSCAN). But these tests make use of radioactive substances, are performed only in specialized imaging centers and can also be very expensive [1]. Moreover, the loss of dopaminergic nigro-striatal neurons is a delayed pathological event in the evolution of the disease, corresponding to Braak stages III-IV.
On the other hand, conventional (CT or MRI) brain scans of PD patients usually appear normal or with minor non-specific changes, so that conventional imaging techniques are only useful for ruling out other diseases that can be secondary causes of parkinsonism.
Therefore, since anatomic MRI is presently not able to directly discern (dopaminergic) neuronal loss in PD [2], studying the associated functional connectivity (FC) changes seems to be a promising approach toward developing non-invasive and non-radioactive neuroimaging markers for this disease.
While many groups have reported such FC changes in PD (see Table A in S1 File for a list of such studies), an in-depth analysis of existing literature revealed significant discrepancies between studies. Investigating the reproducibility of PD FC changes on independent datasets is therefore of crucial importance.
A comprehensive review and analysis of the literature related to resting-state fMRI studies of Parkinson's disease is out of the scope of the present paper   (Table A in S1 File; see also the review by Tahmasian et al. [26]). We only mention some important inconsistencies of reported functional connectivity changes in PD. Due to the crucial importance of the striatum in PD, we first discuss some inconsistencies involving striatal seeds [26]: • Contrary to Hacker et al. [14], Helmich et al. [10] observed no significant difference in caudate functional connectivity in PD.
• On the other hand, contrary to the study Helmich et al. [10], Luo et al. [12] did not observe increased FC of the anterior putamen.
• In contrast to Hacker et al. [14], Luo et al. [12] did not find a FC decrease between the striatal seeds and the brainstem.
There are also inconsistencies involving non-striatal seeds. For example, Wu et al. [7] found disrupted FC between the pre-SMA and the left putamen, as opposed to Helmich et al. [10], who did not find a decreased FC between the putamen and pre-SMA in PD.
Since the motor symptoms are the most striking clinical manifestations in PD, many rs-fMRI studies of PD concentrate on the sensorimotor system, including the basal ganglia, while disregarding any other FC changes. On the other hand, other more unbiased studies tried to determine a more global picture of the FC changes in PD. Some even tried to develop classifiers for the disease based on rs-fMRI data [4,5,18,23], but most studies were not validated on independent datasets.
There are also some gross discrepancies involving even the sign of the main FC changes in PD. For example, Luo et al. [12] found only decreased FC in early stage PD, whereas most studies also find FC increases.
The general picture one gets from the literature is complex and at times somewhat confusing due to the numerous inconsistencies. Of course, these inconsistencies could be due to the different disease stages analyzed, to the inherent functional heterogeneity of the disease, but also to technical differences, or to the differences in the complex data (pre)processing workflows. Therefore, it is crucial to use the same data processing workflow to check the reproducibility of PD-related FC changes on as many independent datasets as possible.
In this work, we report a comparison between three different datasets obtained by completely independent research groups. More precisely, we acquired resting-state scans for 43 Romanian subjects (27 patients and 16 normal controls, with 2 replicate scans per subject) and compared the observed functional connectivity changes with those obtained in two independent datasets, one made available by the PPMI consortium in the US (91 patients, 18 controls) and a second one by the group of Tao Wu in China (20 patients and 20 normal controls). This is the first study investigating the reproducibility of functional connectivity changes in Parkinson's disease on more than two datasets. Given the paucity of publicly available rs-fMRI PD datasets, we advocate the critical importance of data sharing for enabling the discovery of reproducible rs-fMRI biomarkers of PD. All the data from the present study are publicly available at FCP/INDI (http://fcon_1000.projects.nitrc.org/indi/retro/parkinsons.html).

Materials and methods
We briefly describe our approach in Fig 1A to better guide the reader through the remainder of the paper. We started by comparing the PD-related global functional connectivity changes in the 3 datasets and found them to be non-reproducible. Of course, this could be due to disease heterogeneity, but also to technical differences. To better distinguish between these two possibilities, we present a simple method to check for disease heterogeneity using random splits of a single dataset. On the other hand, we search for individual brain region pairs with consistent connectivity changes across all three datasets. Finally, to more directly discriminate PD scans from controls, we train multivariate machine learning classifiers on one dataset and test them on the remaining two. We also train and test classifiers on technically homogeneous random splits of the same dataset, to more directly check for disease heterogeneity.

Datasets
Three resting-state fMRI datasets of Parkinson's disease were compared in this study (see also The datasets are somewhat similar, except for PPMI, which involved patients with a diagnosis of PD for two years or less and who are not taking PD medications, while most patients from the other two studies have been under treatment (most under levodopa). Also, PPMI patients were scanned in the 'eyes open' condition. Still, we argue that our findings were not affected by these differences. Since the datasets were compared in a pairwise manner, any putative discrepancies due to the shorter disease durations in the PPMI dataset would only show up in the NEUROCON-PPMI and Tao Wu-PPMI comparisons, but not in the NEUROCON-Tao Wu comparison. This was not observed in reality.
The NEUROCON study has been approved by the University Emergency Hospital Bucharest ethics committee in accordance with the ethical standards of the 1964 Declaration of Using random splits of a dataset with replicate scans to check for disease (group) heterogeneity: (right) by placing different subjects (with all their replicate scans) in the two splits ("split subjects") and respectively (left) by splitting the replicates of the same subjects in the two splits ("split replicates"). (a,b,c,. . . correspond to subjects, while, for instance, a' and a" are replicate scans for subject a).
https://doi.org/10.1371/journal.pone.0188196.g001 Helsinki and its later amendments. All patients gave their written informed consent to participate in the study.
Preprocessing. All datasets were preprocessed in a uniform manner. The raw scanner data in DICOM format was converted to NIfTI using dcm2nii (https://www.nitrc.org/ projects/dcm2nii/) and further preprocessed using FSL (FMRIB Software Library v5 http://fsl. fmrib.ox.ac.uk/fsl/) as follows: motion correction using MCFLIRT, brain extraction with BET, spatial smoothing (Gaussian kernel FWHM 5mm) and denoising using nonlinear filtering (SUSAN), temporal high-pass filtering (with a cutoff frequency of 0.01 Hz), registration to the standard Montreal Neurological Institute MNI152 template via the anatomical T1 image (more precisely, BBR registration of the BOLD image to the T1 image, followed by 12 DOF lin-ear+nonlinear registration of the latter to the 2mm MNI template). Nonlinear registration was performed at a resampling resolution of 4mm.
Besides the above 'standard' preprocessing workflow, we also considered alternative workflows involving global signal regression (GS) and respectively a temporal bandpass filter (0.01-0.1Hz-an ideal low-pass 0.1Hz filter was used in addition to the default FSL 0.01Hz highpass filter).
Since subject motion in the scanner has been observed to have significant influence on the functional connectivity computed from rs-fMRI data, despite motion correction (e.g. [27]), we also considered subsets of scans with low in-scanner motion (marked by the suffix '0', e.g. 'NC0' and 'PD0'-see also Table B in S1 File).

PD-related functional connectivity changes
Functional connectivity [28] is a rather loosely defined term, which encompasses many different methods used to reveal temporal correlations of BOLD activity across the brain. The simplest method consists in computing the correlations between all pairs of regions of a given brain parcellation, but more sophisticated data decomposition methods, such as Independent Component Analysis (ICA) are also widely used. Such data decomposition methods do not assume a given brain parcellation, but instead construct spatial maps grouping voxels with highly correlated timecourses. (Still, instead of being given a parcellation, such methods need a target number of components.) In our study of the reproducibility of functional connectivity changes in PD, we used brain parcellations constructed independently of the datasets under comparison, rather than applying data decomposition methods, such as group-ICA, since the latter would be inherently biased towards the "training dataset". Group-ICA may obtain a better functional parcellation for the training dataset, but that parcellation would be less appropriate for any other independent dataset ("overfitting"), thereby introducing a bias in the analysis. To avoid these problems, we have chosen to use brain parcellations constructed independently of the datasets under comparison, including functional brain parcellations obtained by group-ICA on completely independent sets of subjects (such as the 'Stanford' functional parcellation [29]).
Moreover, to compensate for potential biases of any specific parcellation, we extended our analyses to a number of 13 different parcellations employed in other rs-fMRI studies, two anatomical (AAL, Talairach) and 11 functional (see Table 2 for more details).
For each parcellation, we computed average timecourses for each region of interest (ROI) and the resting state functional connectivity between each pair of ROIs (ROI 1 ,ROI 2 ) as the Fisher z-transform of the temporal correlation between the corresponding ROI timecourses: For each dataset, we determined significant PD-related FC changes by applying two-sample t-tests (with unequal sample sizes and unequal variances) to the functional connectivities of all ROI pairs. ROI-pairs with significant group differences (NC versus PD) represent regions whose functional connectivity was found to be significantly different in PD patients in that particular dataset. The main aim of this study is to determine whether these changes are reproducible across datasets, to enable the development of functional imaging biomarkers for PD.

Reproducibility of global functional connectivity changes in PD
Comparison of 3 different PD datasets. We first compared the global PD-related functional connectivity changes across the three independent datasets NEUROCON, Tao Wu and PPMI to check to what extent these changes are reproducible. More precisely, we performed pairwise comparisons for all dataset pairs as follows.
For each pair of datasets (i,j) and a fixed parcellation, we checked the extent to which the PD-related FC changes in one dataset are correlated to the changes in the second dataset.
PD-related FC changes were quantified using t-values t(ROI k ,ROI l ) from group comparisons (unpaired two-sample t-tests between NC and PD) of the functional connectivities between pairs of regions of interest FC(ROI k ,ROI l ).
Then the reproducibility R ij across the two datasets i and j was determined as the correlation between the corresponding t-values (viewed as a vector over all ROI pairs) for the two datasets:) where with t i (ROI k ,ROI l ) the t-value corresponding to PD-related FC changes between ROI k and ROI l with respect to dataset i (and similarly T j for dataset j).
For a more intuitive graphical depiction of reproducibility across two datasets, we also constructed the scatter-plot of ROI-pair t-values corresponding to group comparisons in the two datasets (see Fig 2 for an example of such a scatter-plot). Comparing PD-related FC changes (t-values) in the two datasets amounts to plotting for each ROI-pair the t-value in dataset 1 against the t-value in dataset 2. We thereby obtain a scatter-plot with a point for each ROI pair. The comparison of the FC changes in the two datasets thus involves analyzing the distribution of points in the scatter-plot: ideally, perfect reproducibility would entail a diagonal distribution of points in the scatter-plot, corresponding to perfectly correlated t-values in the two datasets.   The correlation of t-values for the two datasets R ij = corr(T i ,T j ), as introduced above in (1), can be viewed as an aggregate measure of the reproducibility across the two datasets i and j.
To obtain a more quantitative measure of the statistical significance of such a correlation R ij between datasets, we performed permutations of the group labels (NC and PD) independently for the two datasets and computed the p-value of the R ij value as the fraction of permutations σ for which the dataset correlation w.r.t. the permuted data R ij (σ) exceeds the real one (R ij ): where N is the total number of permutations. All our permutation tests involved N = 1000 permutations.
Various factors have been mentioned in the literature to affect functional connectivity measures: • subject motion in the scanner [27], • global signal regression (with or without) [37,38], • the choice of the parcellation.
To study the influence of these factors on our reproducibility results, we also considered subsets of scans with low in-scanner motion (marked by the suffix '0', e.g. 'NC0' and 'PD0'), repeated our analyses with global signal regression, bandpass filtering and performed the comparisons using all 13 brain parcellations previously mentioned. Since the PPMI data has been acquired in several different imaging centers, we also considered a potentially more homogeneous subset of scans acquired in a single center (center number 32, with the largest number of PD patient and normal control scans), referred in the following by the suffix 'center32'.
Comparison of random splits of the same PD dataset. As already mentioned in the Introduction, the observed lack of reproducibility of global FC changes across datasets could be due to disease heterogeneity, but also to technical differences. To distinguish between these two possibilities, we devised a method to directly check for disease heterogeneity using random splits of a single dataset with replicate scans. Technical differences can then be excluded since all the scans have been acquired under identical technical conditions. More precisely, since in the NEUROCON study we acquired two replicate scans for each subject, we constructed two homogeneous dataset splits simply by using (different) scans of the same subjects. Additionally, two heterogeneous dataset splits can be obtained by placing different subjects (with all their scans) in each split. In other words, instead of comparing two distinct datasets, we compared two random splits of the same dataset, either: (a) by placing different subjects in the two splits, with all the replicate scans of a subject in the same split ("split subjects", heterogeneous split), or (b) by placing each replicate scan of the same subject in a different split, so that the two splits contain (different) scans of the same subjects ("split replicates", homogeneous split).
Dataset splits (b) are homogeneous since they contain scans of the same subjects, while splits (a) are heterogeneous since they contain scans of different subjects. Therefore, consistent reproducibility across all random heterogeneous splits would indicate disease homogeneity, while non-reproducibility in a large fraction of random heterogeneous splits would imply disease heterogeneity. (In both cases, we expect to observe consistent reproducibility across the homogeneous splits, at least as long as the technical noise is not dominating the biological signal.) A diagram of our method is shown in Fig 1B. As in the pairwise comparisons between different datasets, we used permutation tests and formula (2) to compute p-values of the reproducibility across split datasets, for both the heterogeneous ("split subjects") and the homogeneous ("split replicates") datasets. Due to the random nature of the splits, we repeated the analysis for N s > 1000 different random splits of the original data. To assess the fraction of (non-)reproducible splits, we determined the empirical cumulative distribution function (CDF) of the reproducibility p-values for the N s random splits.
The analysis was also repeated for the data with global signal regression.

Influence of technical factors, preprocessing and parcellation.
We also studied the influence on reproducibility of certain key technical factors and preprocessing steps, such as: • the repetition time (TR), • linear vs. nonlinear registration, • global signal regression, • the specific brain parcellation used for evaluating functional connectivity.
The AAL parcellation (which is typical) was used whenever not specified otherwise. The repetition time might, in principle, influence the measured low-frequency rs-fMRI fluctuations and indirectly the functional connectivities (which are temporal correlations). To study the influence of the repetition time on reproducibility, we constructed a synthetic dataset with a double TR by leaving out every second time-point from the NEUROCON timeseries data (for each scan and each voxel). We then analyzed with our method the reproducibility of group changes in functional connectivity between the original NEUROCON dataset and the synthetic one with a double TR.
To study the impact of registration on reproducibility, we preprocessed the NEUROCON data both with linear and nonlinear registration to the MNI 152 template and determined the reproducibility of group changes in functional connectivity between the two resulting datasets.
Since global signal regression (GSR) has been observed to be very effective at removing scanning artifacts [38], including motion artifacts, we also studied the reproducibility of FC changes between the NEUROCON dataset processed with GSR and the same dataset processed without GSR.
To avoid potential biases of any specific parcellation, we repeated our pairwise comparisons between the 3 PD datasets using all 13 different parcellations mentioned above, including functional and anatomic parcellations, with a wide range of numbers of regions of interest (90 to 950).

Reproducibility of the individual differentiating FC changes
The reproducibility analysis performed above involves global functional connectivity changes, i.e changes in the FC of all ROI pairs, not just the ones that differentiate PD from normal controls. Even with non-reproducible global FC changes, it might be in principle possible that only a very few brain region pairs might still reproducibly differentiate PD patients from controls. To this end, we also studied individual brain region pairs with FC changes that are significant w.r.t. all datasets.
More precisely, for each ROI-pair, we compute max(p), the largest (least significant) of the three p-values obtained in the three datasets (separately for the FC increases and respectively decreases) and sort the ROI-pairs in increasing order of this max(p). The most significant min (max(p)) of these max(p) corresponds to the ROI-pair with the best overall significance with respect to the 3 datasets, as all other ROI-pairs have larger (less significant) p-values with respect to at least one dataset.
Finally, to assess the statistical significance of such a best ROI-pair, we use a permutation test (of the disease labels in each dataset) to check the fraction of random permutations with a more significant (smaller) min(max(p)) than the real data (we performed N = 1000 random permutations).
where min(max(p + )) corresponds to FC increases in NC versus PD. A similar relation holds for the FC decreases min(max(p − )).
The analysis was repeated for all 13 parcellations considered in this study (Table 2).

Learning classifiers for discriminating PD-related FC changes
We used machine learning techniques to learn classifiers that discriminate PD from controls using functional connectivities between ROI pairs as features.
First, we trained classifiers on each one of the 3 datasets (NEUROCON, Tao Wu, PPMI) and tested them on the other two datasets. Both Linear Support Vector Machines (SVM) and Gaussian Naïve Bayes (GNB) classifiers were tested, with progressively increasing numbers of features: N = 10,50,100,500,5000 (functional connectivities between ROI pairs). The N features selected were the best discriminating ROI pair functional connectivities, based on unpaired t-tests between normal and PD scans. As the two classes (NC-Normal Controls and PD-Parkinson's Disease) are not balanced in all 3 datasets, we employ the average accuracy Aacc = (acc(NC)+acc(PD))/2 for assessing the performance of the classifiers (a random classifier is expected to have an average accuracy of 0.5).
Since the different datasets are not technically homogeneous, we also trained and tested classifiers on random splits of the same dataset, to check to what extent the low accuracies are due to technical differences, or to disease heterogeneity. More precisely, we performed 10,000 random splits in half of each dataset, trained a classifier on one half and tested it on the other.

Results
ROI-pairs with significant group differences (NC versus PD) in functional connectivity were found in all three PD datasets: NEUROCON, Tao Wu and PPMI (Tables C and D in S1 File). However, these changes seemed at first sight to be distinct in each dataset. Our main aim in this paper has been to systematically investigate the reproducibility of the PD-related FC changes across independent validation datasets.

PD-related FC changes are non-reproducible across 3 datasets
The reproducibility of global PD-related functional connectivity changes was determined by pairwise comparisons between three independent datasets: NEUROCON, Tao Wu and respectively PPMI. Fig 2 shows the scatter-plots of ROI-pair t-values (corresponding to the group comparison NC-PD) for the three dataset pairs, indicating a lack of reproducibility of global FC changes in PD. (Perfect reproducibility would correspond to a diagonal distribution of points corresponding to ROI pairs with perfectly correlated t-values with respect to both datasets.) Moreover, discriminating ROI-pairs situated in the upper right and respectively lower left corners of one plot are not discriminating in the other plots.
For a more quantitative measure of the reproducibility of FC changes between two datasets, we computed the Pearson correlation between t-values (viewed as vectors of over all ROI pairs) with respect to each dataset (R values shown in Fig 2.) We also estimated the statistical significance (p-values) of these reproducibility measures by permutation tests of the group labels independently for the two datasets- Table 3 shows the reproducibility measure and associated p-value for various pairwise comparisons between the three datasets, with standard highpass preprocessing (>0.01Hz), global signal regression (GS) and respectively bandpass filter (0.01-0.1Hz). Since in-scanner motion may influence FC measures, we present not only a comparison between the full patient and normal control cohort, but also that corresponding to a subset of scans with low in-scanner motion (denoted by the suffix '0'). Moreover, since PPMI data were acquired at many different centers, we also considered the restriction of the PPMI data to the scans from a single center (suffix 'center32'). The AAL parcellation was used in this case, but we also study the influence of the parcellation later on.
A clear lack of reproducibility of global PD-related FC changes is observed in all the three dataset pairs. This is the first study comparing three independent rs-fMRI datasets of PD. The fact that we compare 3 datasets is very important, as it lowers the probability that the lack of reproducibility is due to a dataset that may be "faulty" in some sense-in that case, with 3 datasets we might still observe reproducibility with respect to the remaining dataset pair (which we do not see in reality).

Inconsistent reproducibility of FC changes in heterogeneous dataset splits indicates disease heterogeneity
The non-reproducibility across 3 datasets mentioned above seems to be due to disease heterogeneity, but it could also be due to technical differences. To exclude the latter Table 3 (a) We first constructed random splits by placing different subjects in the two splits, with all the replicate scans of a subject in the same split ("split subjects", heterogeneous splits). As can be seen in Fig 3A (blue curve), a large fraction of these random splits display non-reproducible functional connectivity changes. More precisely, 88% of the random heterogeneous splits show non-reproducibility at the p>0.01 level and 42% at the p>0.05 level. Fig 3B shows the complementary cumulative distribution function (1-CDF) for the corresponding reproducibility measure R (blue curve), while Fig 3C presents a typical scatter-plot of ROI-pair t-values for a random heterogeneous split (one with R equal to the median).

. Reproducibility measure R and associated p-value p for various pairwise comparisons between datasets with standard highpass preprocessing ('standard'), global signal regression (GS) and respectively bandpass filter (BP) ('0'-'low in scanner motion', 'center32'-scans performed at a single PPMI center
(b) Next, we constructed random splits by placing each replicate scan of the same subject in a different split, so that the two splits contain (different) scans of the same subjects ("split replicates", homogeneous splits). In contrast to (a), all homogeneous splits showed reproducibility at the p<10 −3 level (Fig 3A, red curve). Fig 3B shows the complementary CDF for the reproducibility measure (red curve)-note the significantly higher reproducibility (R) values for the homogeneous splits (red curve, median R = 0.71) as compared to the heterogeneous splits (blue curve, median R = 0.21). Fig 3D displays a typical scatter-plot of ROI-pair t-values for a random homogeneous split.
The observed non-reproducibility in a large fraction of heterogeneous dataset splits indicates disease heterogeneity, in line with the comparison between the 3 independent PD datasets. As a control, we did indeed observe consistent reproducibility with respect to all homogeneous dataset splits, demonstrating that the technical noise could not have been the dominating factor behind the erratic non-reproducibility in heterogeneous splits.
The fact that the well-known clinical heterogeneity of Parkinson's disease is also accompanied by heterogeneity in resting state functional connectivity may not retrospectively be a big surprise to an experienced neurologist, although its exact extent could not have been estimated a priori, before analyzing the data. However, does this FC heterogeneity in PD also imply the lack of practical usefulness of rs-fMRI functional connectivity? Are there any other conditions that can be reliably differentiated using resting state functional connectivity? To answer these questions, we applied our approach to a different, potentially more homogeneous contrast, namely that between eyes open and eyes closed resting state conditions in healthy volunteers. Repeating our analysis of reproducibility of FC group changes on random splits of the Beijing eyes open-eyes closed dataset [39] (see Supporting Information) revealed reproducibility (p<0.05) not just in the homogeneous dataset splits, but also in the heterogeneous ones (Fig A  in S1 File-only 6% of the heterogeneous and just 0.8% of the homogeneous random splits were non-reproducible at the p>0.05 level).
Summing up our findings, from the point of view of global FC changes, Parkinson's disease is heterogeneous, as opposed to the eyes open-eyes closed contrast, which is much more homogeneous (Table 4). Exploring the reproducibility of functional connectivity alterations in Parkinson's disease

Influence of technical factors and preprocessing on reproducibility
We found good reproducibility when changing various technical factors or processing options of the NEUROCON data, such as (see Fig 4): • doubling the repetition time (TR), • registration (linear versus nonlinear), • global signal regression (with versus without).
This is in line with our conclusion that functional heterogeneity, rather than these technical factors, is the dominating factor behind the lack of reproducibility of FC changes in different rs-fMRI studies of Parkinson's disease.
We also tested the influence of various rs-fMRI denoising methods on the reproducibility of PD-related FC changes, such as ICA-FIX [40,41], or regression of the mean white matter and/or cerebrospinal fluid signal-none of these denoising methods changed the observed non-reproducibility (data not shown).
We also observed no improvement in reproducibility across random splits of the NEURO-CON dataset after regressing out potential confounders, such as age, gender, or disease duration (data not shown).

Influence of parcellation on reproducibility
We have argued that functional connectivity must be computed with respect to an unbiased parcellation (i.e. one that hasn't been constructed from any of the analyzed datasets). However, any given parcellation has also specific biases that may in principle affect the capacity to discriminate between PD and normal controls-especially relevant factors are the average ROI size and the number of ROIs. Testing the reproducibility of the PD-related global FC changes using 13 different parcellations, with varying numbers of ROIs (between 90 and 950, see Table 2), revealed a lack of reproducibility regardless of parcellation, or dataset pair (Table 5). (The NEUROCON-PPMI comparison was marginally significant (p = 0.05) for the NC-PD contrast, but this significance didn't survive perturbations such as selecting just the 'center32' Exploring the reproducibility of functional connectivity alterations in Parkinson's disease scans from PPMI (p = 0.259), or restriction to the low motion scans NC0-PD0 (p = 0.26), or NC0-PD0_center32 (p = 0.555).)

Marginally significant individual differentiating FC changes in PD
Despite non-reproducibility of PD-related global FC changes across different datasets, a small number of ROI-pairs that distinguish PD from controls may nevertheless, in principle, show reproducible changes across datasets. To check for this possibility, we concentrated on individual brain region pairs with FC changes that are significant w.r.t. all datasets, by sorting the ROI-pairs according to their least significance max(p) with respect to all datasets.
For example, Table 6 shows the ROI-pairs with FC decreases in PD (i.e. positive t-values, corresponding to NC>PD) and max(p) < 0.05 for the Power264 parcellation, without global signal regression. The best ROI-pair has max(p + ) = 0.0125, so min(max(p + )) = 0.0125.
To check whether this min(max(p + )) is statistically significant, we performed permutation tests as described. Table 7 lists these min(max(p ± )) values as well as their associated significance p(min(max(p ± ))) for all 13 parcellations. Only two out of the 13 parcellations yielded significant ROI-pairs at the p<0.05 significance level ('Power264' and 'Talairach'), while a third parcellation produced only marginally significant ROI-pairs ('Shen100', p = 0.055)-see Table 8 and Fig 5. These (marginally) significant ROI-pairs involve visual-sensorimotor, respectively visual-parietal association areas. Whether these ROI-pair changes are more widely reproducible or not will have to await the release of more publicly-available PD rs-fMRI datasets.

Testing classifiers for discriminating PD-related FC changes
Training classifiers on functional connectivity data for each one of the 3 datasets (NEURO-CON, Tao Wu, PPMI) and testing them on the other two datasets produced average accuracies on test data in the range 0.225-0.7 (mean 0.497, standard deviation 0.073), while a random classifier is expected to have an average accuracy of 0.5. Fig 6 shows the corresponding average accuracies Aacc = (acc(NC)+acc(PD))/2 for standard preprocessing (with the default FSL highpass filter at 0.01Hz), global signal regression and respectively bandpass filtering (0.01-0.1Hz) for both linear SVM and Gaussian Naïve Bayes (GNB) classifiers with N = 5000 features (out  Table E in S1 File for the accuracies of classifiers with N = 10,50,100,500,5000 features. Since for each training dataset (for example NEUROCON), we have two different test datasets (PPMI and TaoWu in our example), we also computed an aggregated average accuracy for each dataset by taking the mean of the two average accuracies corresponding to the two remaining test datasets (Aacc(dataset-dataset1)+Aacc(dataset-dataset2))/2. The resulting aggregated average accuracies were low, in the range 0.336-0.591 (mean 0.497, standard deviation 0.0522, compared to 0.5 for a random classifier; see also Fig 7).
Since the three datasets are not technically homogeneous, we also trained and tested classifiers on random splits of the same dataset, to check to what extent the low accuracies are due to technical differences, or to disease heterogeneity. Fig 8 shows the average accuracies for 10,000 random splits in half of each dataset and various preprocessing options. Again, the means of the average accuracies over the 10,000 tests were low, in the range 0.51-0.66, reinforcing the evidence for disease heterogeneity.
Finally, for a a more direct graphical depiction of the heterogeneity of the functional connectomes of the PD patients, we have applied consensus NMF clustering [43] of the PD-related FC changes for a progressively increasing number of clusters (k = 2,. . .,18, Fig C in S1 File). Note that besides the consistent grouping of the replicate scan pairs for each patient, it is difficult to single out an optimal number of clusters k.

Discussion
The accelerated increase in the number of functional connectivity studies of Parkinson's Disease requires a consolidation of the knowledge in this field for enabling the development of clinically relevant rs-fMRI markers for this disease. Unfortunately however, there are many inconsistencies between published works and virtually no high confidence reproducibility studies.
This is the first study investigating the reproducibility of functional connectivity changes in Parkinson's disease on more than two datasets. The fact that we use a uniform data processing workflow for all datasets excludes a large number of technical factors as potential culprits for the observed differences between datasets. Also, the fact that our comparison involves three datasets is essential, as it lowers the probability that the observed lack of reproducibility is due to a problematic dataset-in such a case, with 3 datasets we might still observe reproducibility with respect to the remaining dataset pair, something which we do not see in reality.
To better clarify the issue, we devised a method to directly check for disease heterogeneity using random splits of a single dataset with replicate scans. Technical differences can then be excluded since all the scans have been acquired under identical technical conditions. The fact  that we still observe non-reproducibility in a significant fraction of random subsamples of each individual dataset (these subsamples being technically homogeneous as they come from the same dataset), suggests that functional heterogeneity may be a dominating factor behind the lack of reproducibility of functional connectivity alterations in different resting state fMRI studies of Parkinson's disease.
This could be due to the heterogeneous multi-lesional topography and progression of the neurodegenerative process, possibly accompanied by variable compensatory functional circuit changes, as well as by changes due to dopaminergic medication [26].
The heterogeneity of the functional connectome changes in PD is also more directly apparent in the consensus clustering plots (Fig C in S1 File).
While global PD-related functional connectivity differences were non-reproducible across datasets, we identified a few individual ROI pairs with marginally consistent FC differences across all three datasets. However, finding out whether these differences are more widely reproducible or not will have to await the release of more public PD datasets.
Additionally, we applied more sophisticated multivariate machine learning techniques to learn classifiers that discriminate PD from controls using functional connectivities between ROI pairs as features. However, training classifiers on each one of the three datasets (NEURO-CON, Tao Wu, PPMI) produced only low accuracies on the remaining two (test) datasets, in line with the preceding results. Furthermore, since the three datasets are not technically homogeneous, we also trained and tested classifiers on random splits of the same dataset, to more directly check to what extent the low accuracies are due to technical differences, or to disease heterogeneity. Again, we obtained low average accuracies (with means in the range 0.51-0.66), reinforcing the evidence for disease heterogeneity. Interestingly, these results are consistent with a recent study [44] on multisite generalizability of schizophrenia diagnosis based on functional brain connectivity, which reported multisite classification accuracies below 70%, in contrast to over 30 previously published, largely single-site schizophrenia studies, whose average reported classification accuracy exceeds 80%. Therefore, given the paucity of publicly available rs-fMRI PD datasets, we advocate the critical importance of data sharing for enabling the discovery of reproducible and clinically useful functional imaging biomarkers of PD. In this regard, we view our study as an important first step towards more refined reproducibility studies that would be possible only with more publicly available datasets. In view of the many inconsistencies found in the published literature on PD-related functional connectivity changes, we strongly argue for a direct computational comparison of PD rs-fMRI datasets using a uniform data processing workflow, to avoid publication bias as well as processing workflow differences in the separate studies.

Limitations
The present study has concentrated on PD-related changes in functional connectivity (loosely viewed as correlations between different regions of interest), rather than changes in fluctuations of the amplitude of the rs-fMRI signal. In a complementary study, Wu et al. [45] observed PD-related changes in ALFF, but with rather limited reproducibility. An in-depth analysis of the reproducibility of PD-related differences in the amplitude of fluctuations is out of the scope of the present paper.