Grey Matter Atrophy in Multiple Sclerosis: Clinical Interpretation Depends on Choice of Analysis Method

Background Studies disagree on the location of grey matter (GM) atrophy in the multiple sclerosis (MS) brain. Aim To examine the consistency between FSL, FreeSurfer, SPM for GM atrophy measurement (for volumes, patient/control discrimination, and correlations with cognition). Materials and Methods 127 MS patients and 50 controls were included and cortical and deep grey matter (DGM) volumetrics were performed. Consistency of volumes was assessed with Intraclass Correlation Coefficient/ICC. Consistency of patients/controls discrimination was assessed with Cohen’s d, t-tests, MANOVA and a penalized double-loop logistic classifier. Consistency of association with cognition was assessed with Pearson correlation coefficient and ANOVA. Voxel-based morphometry (SPM-VBM and FSL-VBM) and vertex-wise FreeSurfer were used for group-level comparisons. Results The highest volumetry ICC were between SPM and FreeSurfer for cortical regions, and the lowest between SPM and FreeSurfer for DGM. The caudate nucleus and temporal lobes had high consistency between all software, while amygdala had lowest volumetric consistency. Consistency of patients/controls discrimination was largest in the DGM for all software, especially for thalamus and pallidum. The penalized double-loop logistic classifier most often selected the thalamus, pallidum and amygdala for all software. FSL yielded the largest number of significant correlations. DGM yielded stronger correlations with cognition than cortical volumes. Bilateral putamen and left insula volumes correlated with cognition using all methods. Conclusion GM volumes from FreeSurfer, FSL and SPM are different, especially for cortical regions. While group-level separation between MS and controls is comparable, correlations between regional GM volumes and clinical/cognitive variables in MS should be cautiously interpreted.


Introduction
Brain atrophy in multiple sclerosis (MS) can be quantified and monitored over time with reliable and reproducible techniques [1] Especially grey matter (GM) atrophy seems to be correlated to clinical and neuropsychological measures [2], and it also becomes more prominent in advanced disease stages [3,4]. GM atrophy seems to be unevenly distributed across the brain in MS, with possible preference for insular cortex and thalamus, but studies disagree regarding not only the anatomical distribution of GM atrophy [5][6][7][8], but also regarding the clinical correlations of GM atrophy [9][10][11][12][13]. Drawing a conclusion from the body of published studies is difficult due to the heterogeneity of patient groups, and of the software packages used for segmentation. A study comparing methods directly in MS is so far lacking.
The goal of this study was to examine the consistency of GM atrophy measurements using three software (FSL, FreeSurfer and SPM) used frequently in MS GM atrophy literature, in a large cohort of MS patients with similar disease duration. We analyzed for each structure separately: the consistency between methods for regional GM volumes, consistency of the patients/ controls discrimination, and consistency of associations with neuropsychological measures.

Descriptives and clinical measures
As part of their 6-years follow-up for an ongoing inception cohort study (PRESTO) in which MS patients were included six years prior, just before or at diagnosis, subjects underwent an MRI scan between October 2010 and December 2012 [14]. Age-, gender-and educationmatched healthy controls from the Amsterdam area were also recruited from existing databases. The research was approved by the authors' Institutional Review Board (Medisch Ethische Toetsingscommissie of the VU University Medical Center, Amsterdam, The Netherlands) and all clinical investigation was conducted according to the principles expressed in the Declaration of Helsinki. Written informed consent was obtained from the participants, as indicated in the Methods section of the manuscript. Physical disability was measured using the Expanded Disability Status Scale (EDSS) and was relatively mild (median EDSS 2.0, range 0.0-6.0). Patients were relapse-free and without steroid treatment for at least 2 months.
Subjects underwent a comprehensive neuropsychological assessment of seven cognitive domains (executive functioning, verbal memory, information processing speed, visuospatial memory, working memory, attention, and psychomotor speed) [14] and an "average cognition" Z score was calculated by averaging Z scores of all separate domains, as previously described. This average cognition Z-score will be referred to as "cognition".

Image preprocessing
The analyses were performed using the cluster computer of the Neuroscience Campus Amsterdam running under Linux CentOS (http://www.neurosciencecampus-amsterdam.nl/en/newsagenda/news-archive/2013/NCAGRID-cluster-computer.asp). Brain volumes were analyzed on the 3D-T1 sequence, after removal of excessive neck slices and lesion-filling. The lesions were first automatically segmented on the 3D-FLAIR images using an in-house developed kNN-TTP technique [15], the resulting lesion masks being registered to the 3DT1 images. These lesion masks were used for lesion-filling with LEAP [16], in order to diminish the effect of the white matter (WM) lesions on the segmentation of the GM. To avoid filling the DGM with WM intensities, the DGM lesions were removed from the masks before lesionfilling. The workflow is depicted in Fig 1. The T1-hypointense and T2-hyperintense lesions were marked on the T1-images and the dual-echo images respectively by an experienced rater, and lesion volumes were measured using a local thresholding technique using Alice software (Hayden Image Processing Group/ Parexel International Corp, Waltham, MA).

Image processing: GM volumetrics
The following regional GM segmentation algorithms were used for lobar GM volumes and DGM volumes (details of the implementation are given in the sections below). FSL: SIENAX [17] registers the individual scan to the standard space brain (derived from the MNI-152 standard image), using the skull as a scaling constraint, and uses FAST for automated tissue-type segmentation. The Harvard-Oxford atlas delivered with FSL was non-linearly registered (using FNIRT) and used on the GM partial volume estimates for calculating the lobar cortical GM volumes. Volumetry of DGM structures was performed using FIRST [18], which models the outer surface of each DGM structure as a mesh, and finally, it assigns each voxel in the image the appropriate label to indicate of which structure it is a part, taking into account local variations in structure surface shape, as well as the presence of neighboring structures.
FreeSurfer [19,20] performs cortical (surface-based) analysis for the cortical thicknesses and volumes, and volume analysis for the DGM structures in the native space. FreeSurfer also outputs the results within the cerebral lobes (left and right): frontal, temporal, insula, cingulate, parietal, occipital. FreeSurfer was used both with and without the FLAIR images for pial-surface refinement (https://surfer.nmr.mgh.harvard.edu/fswiki/ReleaseNotes).
SPM8 (Statistical Parametric Mapping Functional Imaging Laboratory, University College London, London, UK) accessed through the VBM8 toolbox (http://dbm.neuro.uni-jena.de/ vbm/), automatically identifies and quantifies per voxel GM, WM, and CSF in all scans. The AAL atlas delivered as part of SPM toolboxes was registered with FSL-FLIRT and FSL-FNIRT to the native scans and then used for calculating the lobar GM volumes. This atlas does not map the nucleus accumbens.

Inspection of segmentation quality
All outputs were inspected for quality, and excluded if gross errors were visible.

Influence of atlas
In order to separate the influence of the segmentation algorithm from that of the atlases used in this study for each algorithm, the volumes of the cortical lobes were also calculated from the FSL and SPM segmentation by overlaying for each subject the FreeSurfer brain lobes parcellation on the GM partial volume estimates. Here, to avoid exclusion of GM voxels due to partial volume effects in the FSL and SPM segmentations, the lobar parcellation for each subject was dilated using a 3x3x3 kernel four times, using modal dilation to avoid intersections between structures. As an extra analysis, the AAL atlas registered to the SPM segmentation was also dilated using a 3x3x3 kernel four times, using modal dilation to avoid intersections between structures.

Image processing: Voxelwise and vertexwise statistical analyses
Voxel-based morphometry (SPM-VBM8 toolbox and FSL-VBM) and vertex-wise FreeSurfer were used to map, and compare between these three methods, the local GM differences between MS patients and controls. In FSL-VBM [21], the GM images were registered to the MNI-152 standard space and averaged to create a left-right symmetric, study-specific grey matter template. In SPM-VBM8 toolbox, a DARTEL template of the GM [22] of all scans was created by nonlinearly aligning the GM images to a common space of 1.5x1.5x1.5mm 3 . For both methods an 8-mm Gaussian kernel was used for smoothing, as frequently applied in literature. Voxelwise GLM was applied using parametric testing for SPM and permutation-based nonparametric testing for FSL, correcting for multiple comparisons across space.
Vertex-wise FreeSurfer analysis was used to create a study-specific template, and vertexwise comparisons were performed after smoothing using a 10-mm FWHM Gaussian kernel, as frequently applied in literature.
As gender was previously shown to greatly influence atrophy in this cohort of MS patients [14], the voxel-wise and vertex-wise statistical comparisons between patients and controls aimed at localizing GM differences were run both with and without gender and age as covariates.

Statistical analysis
The consistency of GM volumes was assessed in SPSS20 with the Intraclass Correlation Coefficient (ICC) (consistency) for each measured lobar GM and DGM volume.
For the consistency of patient/control discrimination we assessed both each anatomical structure separately and also the consistency with which all structures contributed to patientcontrol discrimination in a combined model. The consistency of patient/control discrimination for each anatomical structure separately was assessed with Cohen's d, t-tests, and MANOVA with age and gender included in SPSS20. The consistency with which all structures contributed to patient-control discrimination in a combined model was assessed with a penalized doubleloop logistic classifier [23,24] in R for each segmentation software separately. The double-loop classifier used L1 penalization based logistic regression for classifying the binomial response. The model parameters were estimated using the inner loop cross-validation and the prediction accuracy for the estimated model parameters was estimated using the outer cross-validation loop. This division of the classifier into two cross-validation loops was done to avoid over-fitting and to reduce the number of false positives.
The consistency of the association with cognition was assessed in SPSS20 with Pearson correlation coefficient between normalized lobar GM and DGM structure volumes and cognition. Additionally, multiple regression was also performed with age and gender included, using cognition as outcome measure.
Significance was considered at p < .05, after appropriate correction for multiple comparison.
Statistical analyses for FSL-VBM were performed in FSL using Randomise, for SPM-VBM in SPM, and FreeSurfer vertex-wise statistics were performed in FreeSurfer both with and without age and gender as covariates.

Results
A total of 127 RRMS patients (91 women) and 50 healthy controls (30 women) were included. Clinical data are detailed in Table 1. In order to use the same images for all analyses, 33 scans of patients (but no controls) had to be excluded after visual inspection for segmentation errors. The errors were as follows: for 17 patients the cortex was severely underestimated in FreeSurfer (because the pial surface ran through the cortex), for 15 patients gyri were missed at the segmentation in FreeSurfer, and for one patient with severe pathology and asymmetric ventricles the DGM segmentation failed for both FreeSurfer and FSL-FIRST (only the structures on the right side were correctly segmented). After excluding these subjects, 94 patients and 50 controls remained in the analyses.
The differences between the excluded patients and the patients still in the study are presented in Table 1. The excluded patients group contained more men; the mean age was higher, with lower total grey matter volume and higher WM lesion volumes.
A segmentation example for all techniques is presented in Fig 2. The normalized brain volumes from FSL-SIENAX were lower in patients compared to controls (1.49±0.06L vs. 1.53±0.07L, p < .001), as well as the normalized grey matter volumes from FSL (0.81±0.05L vs. 0.83±0.05L, p = .02). Tables 2 and 3 present the regional GM volumes obtained with each method. Only DGM volumes tended to be smaller in MS compared to controls for all methods. As expected, adding FLAIR pial surface refinement did not alter FreeSurfer results for DGM volumes, while it did change FreeSurfer cortical regional GM volumes. Table 4 provides the ICC values between methods for each GM regional volume. For the cortex, the agreement was higher, with most lobes having high agreement (ICC>.7) for all pairs of

Consistency of patient/control discrimination
The cortical regions were not significantly different between patients and controls with t-tests, but when using MANOVA (with age and gender included) the temporal lobes bilaterally were significantly different between MS and controls for FSL and FreeSurfer, but not for SPM.
The following DGM structures were significantly different between patients and controls with all methods when t-tests were used: bilateral thalamus, caudate nucleus, putamen, hippocampus, amygdala.
The Cohen's d results for all methods are depicted in Fig 4, with larger effects for the DGM structures than cortical regions for all methods, except for the pallidum in SPM. The MAN-OVA results were similar to the t-tests results for the DGM structures.
The penalized double-loop logistic classifier revealed bilateral thalamus, pallidum and amygdala to consistently contribute to patient-control separation for all three methods.
For the lobar volumes, general cognition correlated weakly (r = .206-.239) with the right insular volume with all methods, and also with the left insular volume for FSL. Significant correlations with cognition are listed in Table 5. Table 3. DGM volumes (mL) with all software (mean(SD)). Among DGM, the bilateral putamen volumes correlated with cognition for all methods (r = .226 to .383). FSL yielded the largest number of significant correlations, including thalamus and hippocampus volumes bilaterally. The thalamus volumes bilaterally also significantly correlated for SPM, but not for FreeSurfer. Surprisingly, neither FreeSurfer nor SPM yielded significant correlations with hippocampus volumes. ANOVA (with gender and age included in the model) revealed the bilateral insula and the left temporal volume as significantly correlated with cognition using all methods. Among DGM structures, the bilateral thalamus, putamen, hippocampus and the right amygdala volumes were also correlated with cognition using all methods. The left amydgala volume was also correlated with cognition when using FSL and FreeSurfer but not SPM. The other regional volumes (including right temporal lobe and bilateral pallidum) correlated with cognition for some but not all methods (Table 5).

Voxel-wise and vertex-wise comparisons between MS and controls
FSL-VBM, with and without covariates, revealed significant clusters in the thalamus, caudate nucleus, amygdala, insula, part of the left precentral gyrus. When using covariates, also both precentral gyri and the precuneus were significantly different. SPM-VBM8 revealed thalamus and caudate nucleus bilaterally to be significantly different with and without covariates. When using covariates, also the right hippocampus, bilateral putamen and the left insula were significant. FreeSurfer vertex-wise analysis revealed no significant cortical thickness differences without covariates, but in the analysis with covariates did reveal significant clusters in the left paracentral lobule, and bilateral temporal lobes (Fig 5).

Influence of the atlases on the agreement measures
When using the FreeSurfer parcellation on the FSL and SPM segmentation, the ICC values with FreeSurfer became very high (over 0.9) but also the ICC between FSL and SPM dramatically improved, with all ICC>0.7. However, for SPM, even when using the FreeSurfer parcellation, the agreement with FSL remained low for the right insula and the occipital lobes bilaterally. After dilating the AAL atlas the ICC impoved for the cortical structures but not for the DGM structures.

Discussion
Our study investigated the agreement between three software packages used frequently in MS GM atrophy literature, using a large cohort of MS patients with a single disease type, of homogeneous disease duration, and matched controls. We not only analyzed the consistency of volume measurements but also the consistency of patient-controls discrimination and the consistency of correlations with cognition. Marked differences between software packages were observed, as the consistency of volume measurements was suboptimal with ICC values ranging  Table 4) and b. L-Amygdala (low ICC values-see Table 4). doi:10.1371/journal.pone.0143942.g003

Consistency of Grey Matter Measurement in MS
widely. The structures with the lowest agreement were the insula, amygdala, nucleus accumbens and hippocampus, which have received a lot of interest in MS research [9,[25][26][27][28].
Although the agreement was low, it may not affect patient-control discrimination at the group level, as shown by the fact that the Cohen's d values and the combined models from the logistic classifier were comparable between methods in our study. However, the low ICC values indicate potential difficulties when addressing relations with clinical or cognitive variables, and this was confirmed by the variability across methods of correlations with overall cognition found in this study.
Previous studies report different regions as the site of GM atrophy in MS; these differences may be due to the study methodology including the choice of image analysis method used to measure the GM atrophy. We were able to confirm this in a large MS patients cohort. Differences between automated methods have already been reported in bipolar disorder [29] and ALS [30]. A previous study in MS has shown differences in whole brain atrophy measurements between software packages [31]. The reason for such strong differences between software packages is difficult to pinpoint. Another previous study evaluated the accuracy of GM segmentation in MS against a manually drawn gold-standard in a small patient group [32]. That study already indicated that the methods do not agree on the precise location of the structures. In our study due to the large sample size we were unable to produce a manual segmentation reference, so we opted for the comparison of actual automated methods used in the literature. Also in our analysis comparing lobar volumes, and not at a voxel-wise level, there is considerable variation between methods which validates the results in the aforementioned study [32]. The fact that DGM structures display larger differences between methods than cortical GM may be due, first, to the larger size of the lobes compared to the DGM structures, leading to smaller relative effects of small segmentation differences, and second, to the less conspicuous nature of the DGM borders compared to the cortical borders. Lesion misclassification may be a major problem for techniques such as SIENAX and VBM, but not FIRST [33]. We tried to minimize the effect of the presence of white matter lesions or excessive neck slices in the image by using only images after neck-clipping and lesion-filling. Differences between the software packages could therefore stem from the segmentation itself, the atlas used, or the smoothing kernel in case of voxelwise analyses). In fact, in our study the agreement dramatically improved when using the FreeSurfer lobar parcellation on the FSL and SPM segmentation, which would indicate that the difference between software is in large part related to the atlas. This could be due to either its definition of the anatomical structures or the registration method used to apply the atlas; the latter may affect primarily smaller structures such as the pallidum, which showed poor agreement in our study. Similarly, structures such as the nucleus accumbens may have smaller differences in intensity compared to the surrounding tissue on the 3DT1 images, and will be more prone to errors in segmentation for intensitybased segmentation methods. The lack of a generally accepted gold-standard for regional atrophy measurements including all brain structures, impedes on the assessment of false negative or false positive voxels.
Still the separation between MS patients and controls was good, with some structures showing consistently large effect sizes (see Cohen's d in Fig 4), and consistent selection of thalamus, putamen and amygdala as the most important regions for MS-control discrimination in the double-loop classifier. This points to a real disease effect, in that for early RRMS patients the DGM structures are atrophic; cortical structures may be affected at a later disease stage [25]. The thalamus has already been pointed to as different between patients and controls in MS [5,10,[34][35][36], as well as the amygdala [37] and the putamen [8,[38][39][40].
Although we found few differences between patients and controls in the cortex, we used the lobar cortical GM for comparison. Therefore there may be differences at the more detailed level of the gyri, as pointed out by the voxel-based analyses. Further studies with larger sample sizes in a longitudinal setting should assess the consistency at gyrus level.
For MS research the VBM analyses are run with covariates in the model, which allows for plausible results. In order to more specifically examine the influence of image processing and analysis, we also ran the analysis without covariates: for all methods more regions appeared different between patients and controls when covariates were included in the model. FSL in general detected more regions as significantly different. We were able to confirm the cortical regions already reported as different between patients and controls in a meta-analysis of VBM studies in MS [41]. Differences between results of cortical studies are apparent however, which could be due to the possible different GM atrophy evolution of different MS types, which we precluded by selecting only RRMS patients for our study. The processing errors may artificially appear to increase the differences between patients and controls by underestimating the grey matter volumes in patients with more severe pathology, and the analysis software may also be more likely to fail for these images: on half of the scans who had to be excluded from our sample the cortex had been severely underestimated with the pial surface running through the cortex; the patients were older and had higher WM lesion loads. Future studies should also look at longitudinal effects with different suitable software packages, as longitudinal measurements would provide important information for the monitoring of patients in clinical studies and may also have smaller errors [42].
Although the separation at the group level between patients and controls was relatively good for all methods, the differences between methods were not systematic, which led to differences in correlations with other measures. Given the low ICC between GM atrophy methods, correlations with clinical measures, such as cognition and EDSS scores, may be affected. We indeed observed varying correlations with overall cognition, and it is therefore clear that correlations with overall cognition reported in the literature have to be interpreted with some caution. In line with previous research, the right insula was most commonly related to cognition, whether age and gender were used as covariates or not [43] as well as the putamen bilaterally [44,45].
In conclusion, regional GM volumes obtained from these three popular image analysis methods can be very different, especially for cortical regions and to a lesser degree for DGM. While group-level separation between MS and controls is comparable between analysis methods, correlations with cognitive and clinical measures can vary, depending on the analysis method chosen. Correlations between regional GM volumes and clinical or cognitive variables in MS should be interpreted with caution.