The Effects of FreeSurfer Version, Workstation Type, and Macintosh Operating System Version on Anatomical Volume and Cortical Thickness Measurements

FreeSurfer is a popular software package to measure cortical thickness and volume of neuroanatomical structures. However, little if any is known about measurement reliability across various data processing conditions. Using a set of 30 anatomical T1-weighted 3T MRI scans, we investigated the effects of data processing variables such as FreeSurfer version (v4.3.1, v4.5.0, and v5.0.0), workstation (Macintosh and Hewlett-Packard), and Macintosh operating system version (OSX 10.5 and OSX 10.6). Significant differences were revealed between FreeSurfer version v5.0.0 and the two earlier versions. These differences were on average 8.8±6.6% (range 1.3–64.0%) (volume) and 2.8±1.3% (1.1–7.7%) (cortical thickness). About a factor two smaller differences were detected between Macintosh and Hewlett-Packard workstations and between OSX 10.5 and OSX 10.6. The observed differences are similar in magnitude as effect sizes reported in accuracy evaluations and neurodegenerative studies. The main conclusion is that in the context of an ongoing study, users are discouraged to update to a new major release of either FreeSurfer or operating system or to switch to a different type of workstation without repeating the analysis; results thus give a quantitative support to successive recommendations stated by FreeSurfer developers over the years. Moreover, in view of the large and significant cross-version differences, it is concluded that formal assessment of the accuracy of FreeSurfer is desirable.


Introduction
FreeSurfer (Athinoula A. Martinos Center for Biomedical Imaging, Harvard-MIT, Boston) comprises a popular and freely available set of tools for deriving neuroanatomical volume and cortical thickness measurements from automated brain segmentation (http://surfer.nmr.mgh.harvard.edu),recently summarised by Fischl [1].A number of reported studies discussed the accuracy of the technique by comparing the volume of specific brain structures, such as the hippocampus or amygdala, with manually derived volumes [2][3][4][5].The measurement of cortical thickness was validated against histological analysis [6] and manual measurements [7,8].Also the reliability of the measurements was subject of a number of investigations.Some of these studies addressed the effect of scanner-specific parameters, including field strength, pulse sequence, scanner upgrade, and vendor (cortical thickness: [9,10]; volume: [11]).In addition, the scan-rescan variability of a number of subcortical brain volumes was assessed [12][13][14].Finally, it has been shown that Freesurfer is capable of reliably capturing (subtle) morphological and pathological changes in the brain (e.g., [5,13]).
Since FreeSurfer is CPU-intensive (20-30 hours per brain for a full segmentation is not exceptional), it is common practice to distribute the computational load among the available central processor units (CPUs) on a single workstation and/or among several workstations.Given this context, a number of questions suggest themselves: (1) does every CPU produce the same results; (2) is there any interaction between the processes running simultaneously on the same workstation; (3) does every workstation produce the same results?
Just like similar neuroimaging packages, new releases of FreeSurfer are issued regularly, fixing known bugs and improving existing tools and/or adding new ones.Each release is accompanied with documentation describing the changes relative to the previous release (http://surfer.nmr.mgh.harvard.edu/fswiki/ReleaseNotes).However, transition to a new release during the course of a study may affect the results and is therefore discouraged by the developers of FreeSurfer.This potential source of variation in outcome may invalidate comparisons between different studies.As yet, the sources and effect sizes of these variations have never been investigated in detail.
A related question is whether differences in the results may arise due to different releases of the operating system (OS).
The goal of the present study was to address the above mentioned questions by repeating the automated segmentation on the same workstation and on different workstations using a set of 30 anatomical T1-weighted MRI scans.Three different versions of FreeSurfer were used on a single Hewlett-Packard workstation and several Macintosh workstations running under two different OSX versions.In particular, we aimed to get insight into the variabilities resulting from these different data processing conditions and to compare these with reported accuracy and reliability results and morphological and pathological cerebral changes.Although the developers of FreeSurfer have been explicitly recommending users not to mix FreeSurfer versions, platforms, and OS versions within a study (see public archives at http:/www.mail-archive.com/freesurfer@nmr.mgh.harvard.edu),to our knowledge this is the first time that the effects of these different processing conditions have been quantified systematically.

Ethics statement
The study was approved by the ethics committee of the Maastricht University Medical Center and all participants gave written informed consent in accordance with the committee's guidelines and with the Declaration of Helsinki [15].All patients were mentally competent to consent as evaluated by trained psychology graduates during the screening and informed consent procedures, i.e., participating patients were fully understanding information disclosures and study procedures.

MRI acquisition
MRI scans were acquired with a 3.0 T Siemens Allegra MRI scanner (Siemens Medical Systems, Erlangen, Germany).Coronal T1-weighted images were obtained using an ADNI MP-RAGE sequence with TR = 2250 msec, TE = 2.6 msec, and a flip angle of 9u.The number of slices was 192 and slice thickness 1.0 mm with no interslice gap.The image matrix was 2566256 and the field of view 2566256 mm.The resulting voxel size was 1.061.061.0 mm 3 .

Participants
For the current study, data from an ongoing longitudinal MRI study were used [16].From a large sample consisting of 89 patients with psychotic disorder, 98 siblings of patients with psychotic disorder, and 87 controls, a total of 30 participants were randomly drawn, 10 out of each group.The age (years) of the individuals was 28.165.

Workstations
Two workstations and corresponding operating systems were at our disposal for this study (Table 1).On the Macintosh (Mac) platforms, FreeSurfer used the UNIX shell while on the Hewlett-Packard (HP) platform, LINUX was used (CentOS 5.3).One Mac workstation was configured to run under two different OS versions by means of an external disk.Although OSX 10.6 is able to run in 64 bits mode, we used 32 bits mode only (see next section).By contrast, on the HP platform, CentOS was used in 64 bits mode.

FreeSurfer
The FreeSurfer analysis pipeline comprises two main processing streams, a volume-based stream and a surface-based stream.The volume-based stream is designed to assign a neuroanatomical label to each (sub)cortical voxel, whereas the surface-based stream is developed to derive the white and pial surfaces from which, among others, cortical volumes and cortical thickness (CT) are derived.More details can be found in Document S1 and references [2,[17][18][19][20][21][22][23][24][25][26][27][28].
The volumes are presented by FreeSurfer in the form of tables and labeled voxels.The tabulated volumes are more accurate than the voxel volumes because they are corrected for partial volume effects.Both types of volumes were used in our analysis (see Document S1 for more details).
In order to compare the results with accuracy results previously reported by Lehmann and colleagues [5], a few white matter and grey matter regions were merged to produce a whole gyrus or lobe (left and right), such as medial-inferior temporal gyrus (MITG), superior temporal gyrus (STG), and temporal lobe (TempL).Simarly, total ventricle volume (Ventr) was constructed (left+right added together).In this manner, a total of 7 composite volumes were assembled.(The respective segmentation labels were kindly provided to us by Dr. Manja Lehmann, University College London, UK, see Document S1 for more details).
In total, we computed 190 (sub)cortical volumes (185 for v5.0.0) and 68 CT values.It should be noted that no manual corrections were made to any of the FreeSurfer results in order to ensure a valid analysis.However, a visual inspection was performed to check the segmentations.
Three versions of FreeSurfer were used: v4.3.1, released on 19 May 2009, version v4.5.0, released on 11 August 2009, and version v5.0.0, released on 16 August 2010.For the Mac workstations these are 32 bits versions (due to problems to build some third party libraries in 64 bits mode on the Mac), whereas for the HP workstation these are 64 bits versions.

The experiments
A number of experiments were carried out to examine the variability of the results due to different data processing conditions by repeating the data processing (''run'') on the same dataset: 1. Difference between repeated single runs on Mac and HP workstations 2. Effect of parallel runs using eight processors on Mac and HP workstations 3. Difference between runs with v4.3.1, v4.5.0, and v5.0.0 on Mac and HP workstations 4. Difference between runs with OSX 10.6.4/5 and OSX 10.5.8 on Mac workstations Each run was started by opening a terminal window which initialised the environmental variables, followed by issuing the FreeSurfer ''recon-all'' command to start the data processing stream.A single run meant that only one stream was active at any time.Parallel runs on eight processors involved opening of eight terminal windows and issuing in each window the recon-all command.This resulted in eight instances of the FreeSurfer pipeline being active at the same time during which all available CPU power was mobilised.Obviously, on the iMac with two processors only two streams could be active at the same time.
The experiments where carried out on multiple Mac workstations and a single HP workstation.Experiments 1 and 2 were designed (1) to disclose any difference between runs on the same workstation and between runs on different workstations and (2) to reveal any interference between parallel running streams.Experiment 3 provided insight into the effects of different versions of FreeSurfer.In experiments 1 to 3, OSX 10.5.8 was used on the Mac workstations.Finally, the effects of different Mac OSX versions were investigated in experiment 4 for all FreeSurfer versions used.

Statistical measures
Several statistical measures were employed to quantify the effects studied, such as mean difference and standard deviation, and the coefficient of variation (COV), defined as the percentage standard deviation relative to the mean.In addition, we computed the measure of spatial overlap of a structure, also known as similarity index (SI) or Dice coefficient [29].Its range is between 0 (no overlap) and 1 (complete overlap).Finally, the intra-class correlation coefficient, ICC, based on the one-way random effects model [30] was calculated.More details can be found in Document S1.
For the statistical analysis, the paired Student t test was applied, since each time the outcome of two conditions was compared.We considered two levels of statistical significance, both corrected for multiple comparisons.The first level was set to p,0.05/N, where N is the number of tested volume or CT measurements (classical Bonferroni correction).For the second level, the False Discovery Rate (FDR) method [31] was applied, which is less stringent than the classical Bonferroni correction.With regard to cortical thickness, N equaled 68.With respect to volumes, N depended on the type of volume (tabulated or voxel) and the considered comparison, where we excluded the 7 composite volumes because of their dependence on the other volumes.For versions v4.3.1 and 4.5.0,N was 183/182 (tabulated/voxel) and for version v5.0.0,N was 178/178.In case of a comparison of version v4.3.1 or v4.5.0 with v5.0.0 a total of 176/178 common volumes existed.

Results
No differences were detected between repeated single runs nor between single runs and parallel runs on the same workstation and for the same FreeSurfer and OS version.For the same OS version, all Mac workstations produced identical results.However, differences were revealed between: -Mac and HP workstations -FreeSurfer versions v4.3.1, v4.5.0, and v5.0.0 -OSX 10.5.8 and OSX 10.6.4/5Since we did not find any differences between OSX 10.6.4 and OSX 10.6.5, we will use the terms OSX 10.5 and OSX 10.6 henceforth for OSX 10.5.8 and OSX 10.6.4/5,respectively.
The differences are presented in more detail below, starting with an overview and subsequently zooming in on specific structures and data processing comparisons.For the volume measurements, only voxel volumes are considered since the results for tabulated volumes were very similar.

Significance of result differences
A complete overview in the form of colored cells for all comparisons is illustrated in Figure 1 (voxel volume) and Figure 2 (CT).With such reproduction, reminiscent of a DNA microarray, one can get a good impression of the results at a glance.By far the most colored cells were found for the cross-version comparisons (v4.3.1 vs. v5.0.0 and v4.5.0 vs. v5.0.0).
As described above, the level of statistical significance after correction for multiple comparisons depends on the data processing contrast being considered, see Table 2.It turned out that for the volumes, significant differences were derived only for the cross-version contrasts v5.0.0 vs. the two earlier versions (both Mac and HP).However, for CT, significant differences were found also for some other data processing comparisons.If FDR correction was applied to the volume results, then almost all the colored cells for the cross-version contrasts v5.0.0 vs. v4.3.1/v4.5.0 (both Mac and HP) in Figure 1 represent significant differences.In fact, about half of the 178 structures were significant.With a conservative Bonferroni correction, about a quarter of the structures were significant.For CT rather similar results were obtained: about half/a quarter of the 68 cortical structures were significant after FDR/Bonferroni correction in case of the crossversion contrasts v5.0.0 vs. v4.3.1/v4.5.0 (both Mac and HP).Furthermore, in case of the HP vs. Mac and Mac OSX 10.6 vs. OSX 10.5 contrasts, significant CT differences were found for versions v4.3.1 and v4.5.0, whereas no significant CT differences were present at all for version v5.0.0.

Strength of result differences
A summary of a subset of descriptive statistics for all structures is given in Table 3 (voxel volume) and Table 4 (CT).More details can be found in the Supplementary material, e.g., Table S1 for voxel volumes, Table S2 for tabulated volumes, and Table S3 for cortical thickness.The largest differences (mean as well as range) were found for the cross-version contrasts v4.3.1/v4.5.0 vs. v5.0.0 (both Mac and HP).Generally, the mean (signed or absolute) differences and the COVs for these contrasts were about a factor of two larger than for the other contrasts.The differences manifested a large variation across the structures.The largest absolute volume difference was found in the 5th ventricle and Mac v4.5.0 vs. v5.0.0 contrast: about 64%.However, for the other contrasts some absolute volume differences were large too (about To zoom in on the largest differences observed and to show also the corresponding statistical significances, overlays were produced on the inflated pial surfaces of an average brain (so-called ''fsaverage'', supplied by Freesurfer) for the comparison between Mac version v4.3.1 and v5.0.0. Figure 3  The results for the cortical white matter (WM) structures (i.e., cortically associated gyral WM structures) are depicted in Figure 4.Although it may not be anatomically correct, these structures were overlaid also on the pial surfaces for visualisation purposes.The pattern of WM structures showing highly significant differences was rather similar to the pattern found for GM.Note that almost all of the largest differences are associated with the smallest pvalues for both GM and WM.
For the subcortical structures we generated overlays on coronal, sagittal and transversal slices of T1 data of a single participant, transformed to the MNI305 standard space (Figure 5).Structures showing highly significant differences were the brain stem, the right amygdala, the right accumbens area, and the anterior, midposterior, and posterior partitions of the corpus callosum, left cerebellar white matter, and finally, the left lateral ventricle.Note Figure 1.Overview of the statistical significance of voxel volume comparisons for all considered structures.Each cell is color-coded according to the value of 2log10(p), ranging from black (p.0.05) to white (p#0.00001),see Figure 2 for the color coding scale.The first three columns show the results obtained by comparing HP with Mac workstation for FreeSurfer versions v4.31, v4.5.0, and v5.0.0, respectively.The p values for the differences between the versions v4.3.1, v4.5.0 and v5.0.0 are shown in columns 4 to 6 for the Mac and in columns 7 to 9 for the HP, respectively.Finally, the last three columns refer to the contrast between OSX 10.6 and OSX 10.5 for the three considered FreeSurfer versions.Cells with a small black rectangle inside denote differences which are not significant anymore after FDR correction for multiple comparisons.White cells with an ''X'' represent structures for which no comparison could be made, such as left and right cerebral cortex and left and right cerebral white matter, because these are no longer available in FreeSurfer v5.0.0.In the heading row, the labels 431, 450, and 500 denote FreeSurfer v4.3.1, v4.5.0, and v5.0.0, respectively.doi:10.1371/journal.pone.0038234.g001

Reliability of FreeSurfer Measurements
PLoS ONE | www.plosone.orgthat the fragmentation of some structures (e.g., cerebellum) is due to the application of a nearest neighbor interpolation in the transformation to the MNI305 template.
The pial surface overlays for the CT values are displayed in Figure 6.In this case highly significant differences were found for the structures left and right rostral anterior cingulate cortex, left and right isthmus cingulate cortex, left postcentral gyrus, left superior parietal cortex, left superior frontal gyrus, right insula, right pars triangularis, and right posterior cingulate cortex.

Additional results
Results regarding the determinant of the Talairach transformation matrix and variability under Mac OSX 10.6 can be found in Document S1 and Table S4.

Discussion
In this study, an in-depth analysis was made of the performance of FreeSurfer under various data processing conditions, such as Mac and HP workstations and three versions, v4.3.1, v4.5.0 and v5.0.0.For this analysis, T1 scan data were used pertaining to a sample of 30 individuals participating in an ongoing longitudinal study.A number of experiments were conceived in order to gain insight into the variability of the results by repeating the data processing (''run'') for the same individual(s).To our knowledge no previous research of this type has been conducted, at least not for FreeSurfer.Significant differences in volume and cortical thickness were revealed across FreeSurfer versions.In addition, less pronounced differences were found between the Mac and HP workstations and between Mac OSX 10.5 and OSX 10.6.

General findings
We first investigated if any differences would occur if runs were repeated on the same workstation using a single run or parallel runs.Since this did not reveal any differences, and thus not any interference between parallel running streams occurred, we could safely run as much as 8 pipelines simultaneously on both Mac and HP platforms.In this respect it may be stated that the FreeSurfer pipeline was properly designed.This considerably speeded up our workflow and made it possible to perform the other experiments in a reasonable amount of time (nevertheless about 300 days of computer time were consumed in the present study).Although all but one of the workstations had 16 GB RAM onboard, still some memory competition was being observed during parallel runs, adversely affecting the computation times by about 10-20%.It was also noted that version v5.0.0 was about 20-30% faster than the previous versions.
The other experiments conducted uncovered differences across workstations, FreeSurfer versions, and Mac OSX versions.Particularly large and significant differences in volume and cortical thickness were apparent between version v5.0.0 and earlier versions.In that case, about half of the 178 volume and 68 cortical thickness measurements were significant after FDR correction for multiple comparisons (without any correction almost all structures showed significant effects).Furthermore, the absolute differences were on average about 8.8% (for volumes) and 2.8% (for CT).It is beyond the scope of the present study to explore the origin of the cross-version differences in more detail.However, as the release notes describe changes to the correction for intensity non-uniformities and skull-strip stages in version v5.0.0, we anticipate that these changes present the main contributions to the observed cross-version effects.

Comparison with results reported in the literature
The differences observed in the present study can be evaluated and put into the perspective of reliability and accuracy studies or studies (cross-sectional or longitudinal) on normal or pathological changes in cerebral morphology.Regarding volume reliability, Morey et al. [14] reported an average percentage absolute volume  difference across scan sessions on the same scanner of 3.260.03%for nine subcortical structures (including brain stem and ventricles).For the same structures, we found differences between 2% (HP vs. Mac and Mac OSX 10.6 vs. 10.5 contrasts) and 5.5% (v4.3.1/v4.5.0 vs. v5.0.0 contrasts).Similar reproducibility errors were derived by Jovicich et al. [11]: 1.5-10.2%(absolute differences).Salat et al. [13] reported within-scanner reliabilities of white matter structures on the order of 5% (signed differences) with some exceptions as large as 29.7%.Note that these values are approximately comparable to our findings (Table 3).They even found the same large variabilities for the same structures (entorhinal cortex and frontal and temporal poles) as we did in comparing version v4.3.1/v4.5.0 with v5.0.0.The combined findings illustrate that these structures are especially susceptible to changing conditions, such as scan session and FreeSurfer version.Benedict et al. [12] reported within-scanner COVs between 0.7% and 7.7%.For the structures they considered, we derived COVs in the same range in case of the HP vs. Mac and OSX 10.6 vs. 10.5 comparisons.However, our cross-version COVs were about a factor of two larger and for the left and right amygdala even a factor of three to four larger.Regarding cortical thickness reliability, Han et al. [9] reported a within-scanner variability (absolute difference) of global thickness ,0.03 mm, corresponding to about 1.5%.This result is of the same order of magnitude as our findings (1.1-2.8%,see Table 4).
As for the accuracy of volume segmentation, Lehmann et al. [5] found a good correlation with manual segmentations for most of the structures they considered, with some exceptions (hippocampus, entorhinal cortex and fusiform gyrus) which they attributed to differences in delineation protocols.Since they did not present volume differences, only measures of overlap (Jaccard index) could be compared with our results.The range of Jaccard indices they reported was 0.05-0.89,corresponding to Dice indices in the range of 0.10-0.92,considerably smaller than our values which were all in excess of 0.70 for these structures.The accuracy of the hippocampus and amygdala was also studied by Morey et al. [4]: the percentage absolute volume differences with respect to manual were about 4.5% and 8.0%, respectively.These values are comparable to the cross-version differences derived in the present study (Figure 7).
Finally, the accuracy of cortical thickness was in one study better than 0.5 mm [7] and in another study better than 0.20 mm with a mean difference of 0.077 mm [6].The latter value, corresponding to about 3.8%, is larger than the maximal mean difference of 2.8% we found.
Contrasting our results with structural changes in brain morphology due to pathology (Alzheimer or Huntington's disease) or neuropsychiatric disorders may be even more important.For instance, Lehmann et al. [5] reported on GM volume changes as a result of Alzheimer disease (AD) and semantic dementia (SD).They found absolute changes in volume with respect to controls in the range of 6-129% and 11-91%, respectively.Although these changes are about a factor of 10 larger than the largest differences we observed for the majority of structures, some structures are comparable in volume difference, such as the parahippocampal gyrus in AD (14%/28% (left/right) vs. 13%/11% reported here), superior temporal gyrus in AD (6%/6% vs. 4.1%/4.6%)and whole brain (4-11% vs. 4.8%).Notice that atrophy of the parahippocampal gyrus has recently been suggested as an early biomarker of AD [32].Regional white matter volume differences between normal aging and AD were estimated at between 0.2% and 25.9% (Salat et al. 2009), again comparable to the differences we derived (Table 3).With respect to cortical thickness, Dickerson et al. [33] reported cortical thinning between 2.3% and 13.6% in AD patients compared to non-demented older controls.Rosas et al. [6], [34] studied the impact of Huntington's disease on cortical thinning.They derived differences between 5% in early stages and up to 30% in late stages.Finally, comparing patients suffering from schizophrenia with healthy controls, cortical thickness differences were in the range of 26.7-5.3%[7].In our comparisons we found differences between 0.4% and 7% (Table 4), approximately of the same order of magnitude as all these reported effect sizes.Summarising all the above comparisons, we can draw the conclusion that effect sizes resulting from differences in data processing conditions are rather similar to reliability and accuracy measurements previously reported in the literature.Therefore, our results suggest that these reliability and accuracy measurements depend on specific processing conditions, especially the version of FreeSurfer that was used.Moreover, the effect sizes we derived are more or less of the same order of magnitude as those reported in case-control comparisons in neuropsychiatric illness.The consequence is that the power of such studies may be compromised by changing the data processing conditions.In additon, the observed effects may have profound implications for longitudinal studies: if processing conditions have changed it is recommended to re-run the analysis and to absorb the computational cost.Our findings not only support the recommendations issued explicitly by the FreeSurfer developers over the last years not to mix FreeSurfer versions, platforms, and OS versions, but also validate these in terms of quantification of associated effect sizes.It should be noted in this context that not every upgrade in OS or FreeSurfer will affect the outcome of a study.Minor upgrades in the OS (e.g., from OSX 10.6.4 to OSX10.6.5) usually bring security patches in order to maintain a safe computing environment and they will not lead to different results.Also some minor upgrades in FreeSurfer (e.g., from v4.3.0 to v4.3.1) are necessary to fix bugs specific to some area not related to the research and they can be done without affecting the results.However, a meticulous inspection of the accompanying release notes is mandatory to be sure of a safe upgrade.

Other effects
A discussion on the determinant of the Talairach transformation matrix, Mac vs. HP inconsistencies, the effects of Mac OS version, and variability under Mac OSX 10.6 can be found in Document S1.

Limitations
One limitation of the present study is that no direct analysis is made of how the processing conditions evaluated may affect the accuracy of the volume and cortical thickness measurements.It should be noted that a difference between two FreeSurfer versions does not intrinsically imply a difference in accuracy.An accuracy assessment requires manual segmentations and/or histological measurements.Although this is beyond the scope of our study, it may be a suggestion for future research.
Another limitation of the study may concern the participant sample.It comprised 10 healthy controls, 10 patients with psychotic disorder and 10 siblings of patients with psychotic disorder.A total of 30 individuals may be rather low, however, it represented a compromise between enough power and excessively long processing times.The sample is not representative in comparison with other studies including, for example, patients Figure 6.The same as Figure 3, but now for cortical thickness.The percentage absolute thickness differences is color coded between 0% and 7%, the full range was 1.2% (right supramarginal gyrus) -7.7% (right isthmus cingulate cortex).A p value above 0.025 (2log10(p) = 1.602) was statistically significant after applying a FDR correction for multiple comparisons.doi:10.1371/journal.pone.0038234.g006with Alzheimer's or Huntington's disease.However, it has been shown that FreeSurfer reliably captures subtle morphological and pathological changes in the brain, demonstrating that its performance is independent of the cerebral morphology.Therefore, the results presented here may be considered indicative of expected variabilities in FreeSurfer.
Finally, there may be other variables affecting the volume and cortical thickness measurements, for example CPU type and bits mode (32 or 64 bits) of the OS.The FreeSurfer users install it on various computer workstations, not only MAC, HP workstation, but also Dell, cluster supercomputers, personally assembled computers.So, an almost infinite mixture of CPU types and operating systems exists.It is impracticable to test FreeSurfer on all possible combinations of hardware and software.The type of workstation considered in the present study is just one variable among various hardware environments affecting the results of FreeSurfer.

Conclusions
The general conclusion from the present study and the practical advice it occasions is that users of FreeSurfer should exercise caution and restraint before applying a major upgrade in either the FreeSurfer (in particular) or OS version or to switch to a different type of workstation in the context of an ongoing study.This may be a truism and consistent with sound methodology for scientific experimentation and therefore a matter of common sense.However, the numerous questions about this issue posted by the user community seem to demonstrate the opposite and the results presented here reliably quantify the possible consequences.The message of caution applies not only to FreeSurfer but likely may be generalised to other intricate processing packages in the field of neuroimaging.The packages become more and more complex and therefore it is difficult to keep a check on propagation effects resulting from (small) modifications regarding one of the underlying algorithms.
An additional message inferred from the present study is that authors reporting on results obtained with FreeSurfer are highly recommended to provide not only the version of FreeSurfer that was used but also details on the OS version and workstation.
Finally, given the large and significant differences between the latest version v5.0.0 and earlier versions, it is concluded that an assessment of the accuracy of FreeSurfer is desirable.
displays the results for the grey matter (GM) cortical structures.Again, these overlays demonstrate a non-uniformity in difference across the cortex.Note the highly significant (2log10(p)$4; p#0.0001) differences for the left and right frontal pole, left and right insula, left and right isthmus cingulate cortex, left and right medial orbital frontal cortex, left and right rostral anterior cingulate cortex, left inferior temporal gyrus, left rostral middle frontal cortex, left temporal pole, right fusiform gyrus, right lateral orbital frontal cortex, and right parahippocampal gyrus.

Figure 3 .
Figure 3.The differences in cortical grey matter volumes between FreeSurfer version v4.3.1 and v5.0.0 on a Mac (OSX 10.5).The upper row shows the left and right percentage absolute volume differences overlaid on the inflated respective hemispheres in lateral and medial views of an average brain (''fsaverage'').The differences are color coded between 0% and 15%, the full range was 2.1% (left precentral gyrus) -24.9% (right rostral anterior cingulate cortex).The bottom row depicts the corresponding p values (expressed as 2log10(p)) of the applied Student t test.The p values are color coded between the FDR level of 1.607 (p = 0.025) and 5.000 (p = 0.00001).The dark grey regions represent sulcal folds and the light grey regions represent gyral folds.doi:10.1371/journal.pone.0038234.g003

Table 1 .
Workstations used in this study.By means of an external disk this workstation could run under two different OS versions.Note: All Macintosh workstations used the UNIX shell and the Hewlett-Packard (HP) workstation the LINUX shell.OSX 10.6.4/10.6.5 was used in 32 bits mode, whereas CentOS was used in 64 bits mode.doi:10.1371/journal.pone.0038234.t00140%).These findings translated to correspondingly low ICC values and overlap measures (SI).For CT, the largest absolute thickness difference was found in the right isthmus cingulate cortex and Mac v4.3.1 vs. v5.0.0 contrast: about 7.7%.Note that generally the ICC values for CT were larger than for the volume: they were all above 0.5547 compared to 0.0000 for volume measures.Of particular note is that the results of the HP vs. Mac contrast are almost identical to those of the OSX 10.6 vs. OSX 10.5 contrast, both for volume and CT values.
a N is the number of processors.b Overlays of differences for version v4.3.1 vs. v5.0.0 (Mac)

Table 2 .
Correction for multiple comparisons on volume and cortical thickness differences.
a Number of tested structures.b Number of significant structures after application of correction.doi:10.1371/journal.pone.0038234.t002

Table 3 .
Summary of voxel volume differences.
a Coefficient of variation is the standard deviation of the signed differences relative to the average of the two volume measurements.b ICC is the intra-class correlation coefficient.c SI is the overlap value (Dice coefficient).Note: the results for the tabulated volumes are almost identical.

Table 4 .
Summary of cortical thickness differences.
a Coefficient of variation is the standard deviation of the signed differences relative to the average of the two cortical thickness measurements.b ICC is the intra-class correlation coefficient.doi:10.1371/journal.pone.0038234.t004