Whole brain and deep gray matter atrophy detection over 5 years with 3T MRI in multiple sclerosis using a variety of automated segmentation pipelines

Background Cerebral atrophy is common in multiple sclerosis (MS) and selectively involves gray matter (GM). Several fully automated methods are available to measure whole brain and regional deep GM (DGM) atrophy from MRI. Objective To assess the sensitivity of fully automated MRI segmentation pipelines in detecting brain atrophy in patients with relapsing-remitting (RR) MS and normal controls (NC) over five years. Methods Consistent 3D T1-weighted sequences were performed on a 3T GE unit in 16 mildly disabled patients with RRMS and 16 age-matched NC at baseline and five years. All patients received disease-modifying immunotherapy on-study. Images were applied to two pipelines to assess whole brain atrophy [brain parenchymal fraction (BPF) from SPM12; percentage brain volume change (PBVC) from SIENA] and two other pipelines (FSL-FIRST; FreeSurfer) to assess DGM atrophy (thalamus, caudate, globus pallidus, putamen). MRI change was compared by two sample t-tests. Expanded Disability Status Scale (EDSS) and timed 25-foot walk (T25FW) change was compared by repeated measures proportional odds models. Results Using FreeSurfer, the MS group had a ~10-fold acceleration in on-study volume loss than NC in the caudate (mean decrease 0.51 vs. 0.05 ml, p = 0.022). In contrast, caudate atrophy was not detected by FSL-FIRST (mean decrease 0.21 vs. 0.12 ml, p = 0.53). None of the other pipelines showed any difference in volume loss between groups, for whole brain or regional DGM atrophy (all p>0.38). The MS group showed on-study stability on EDSS (p = 0.47) but slight worsening of T25FW (p = 0.054). Conclusions In this real-world cohort of mildly disabled treated patients with RRMS, we identified ongoing atrophy of the caudate nucleus over five years, despite the lack of any significant whole brain atrophy, compared to healthy controls. The detectability of caudate atrophy was dependent on the MRI segmentation pipeline employed. These findings underscore the increased sensitivity gained when assessing DGM atrophy in monitoring MS.

The objective of our study was to assess the sensitivity of a range of fully automated MRI segmentation pipelines in assessing whole brain and regional DGM volume and their atrophy over five years in patients with relapsing-remitting multiple sclerosis (RRMS) and normal controls (NC) from high-resolution 3T MRI scans.

Ethics statement
All participants provided written informed consent to participate in the study. This consent procedure was approved by our ethics committee. The Partners Human Research Committee approved this study.

Subjects and neurologic examination
Demographic and clinical characteristics are summarized in Table 1 and in the supporting information (S1 File). Sixteen patients with MS and 16 NC underwent baseline and 5-year follow-up scans. MS patients met the International Panel criteria for either relapsing MS or a clinically isolated syndrome (CIS) [30]. All patients underwent an examination by MS specialist neurologist including evaluation of the Expanded Disability Status Scale (EDSS) [31] score and timed 25-foot walk (T25FW) [32]. All patients received disease-modifying immunotherapy during the observation period, as was selected and prescribed by their treating physician according to routine clinical care.

MRI acquisition
All subjects underwent brain MRI at 3T (Signa Excite; GE Healthcare). A consistent coronal 3D T1-weighted modified driven equilibrium Fourier transform (MDEFT) pulse sequence

Reproducibility experiment and scanner upgrade
During the study, by decision of the hospital, the scanner underwent a software and hardware upgrade (gradient coil amplifiers, RF receiver system and software). This was out of our control as there was no intention to do this study related to a scanner upgrade. To investigate scanner effects before and after the upgrade, we also performed a reliability study. Eleven subjects (4 MS and 7 NC) underwent scan-rescan pairs with an average of 7 days between scans (range 0 to 42 days), without an intervening upgrade. In addition, to assess the effect of the upgrade, 3 subjects (2 MS and 1 NC) also underwent a scan-rescan before and after the scanner upgrade scan with an average of 51 days between scans (range 34 to 78 days).

Statistical analysis
All statistical analysis was completed with the statistical package Stata/IC (v. 14.2, StataCorp LLC, College Station, TX). The demographic characteristics (Table 1) of the MS patients and normal controls were compared using a chi-squared test for categorical variables and a t-test for continuous variables. A paired t-test was used to estimate the mean within person change (baseline vs. follow-up) in each of the MRI measures in the normal controls and MS patients separately. The difference in the mean within person change was compared between the two groups using a two sample t-test. The change in the EDSS and T25FW was assessed by  Table 2). Data are shown as means with standard deviations. MS = multiple sclerosis; NC = normal controls; PBVC = percentage brain volume change between baseline and follow-up (a negative number indicates brain volume loss over time); SIENA = structural image evaluation, using normalization, of atrophy, v. 5.0. repeated measures proportional odds models with a patient specific random effect to account for the within patient correlation.

Whole brain volume change: Patients vs. controls
As shown in Table 2, Fig 3 and Fig 4, at both baseline and follow-up time points, the MS group had lower BPF than the NC group, as detected by the SPM12 segmentation pipeline (both p<0.01). However, the SPM12 pipeline did not show any significant BPF change during the observation period in either group (both p>0.3, Table 2). Furthermore, the SPM12 and SIENA pipelines did not show any significant difference in the on-study change in whole brain volume between the MS and NC groups (both p>0.4, Table 2). Thus, neither analysis pipeline was able to demonstrate significant whole brain atrophy in the MS group as compared to the NC group. Deep gray matter volume change: Patients vs. controls As shown in Table 3, Fig 5 and Fig 6, with regard to within group on-study change, significant atrophy was detected by the FSL-FIRST segmentation pipeline during the 5-year period in the putamen (MS group), globus pallidus (both groups), and total DGM (both groups) (all p<0.05). In addition, significant atrophy was detected by the FreeSurfer segmentation pipeline during the 5 year period in the caudate (MS group) and globus pallidus (MS group) (both p<0.05, Table 3). When comparing the on-study difference between baseline and follow-up between the MS and NC groups, the MS group had a~10-fold acceleration in on-study volume loss in the caudate volume detected by the FreeSurfer pipeline (mean decrease 0.51 vs. 0.05 ml, p = 0.022). Thus, both software analysis pipelines were able to demonstrate significant regional DGM atrophy in both groups. However, the topography of atrophy detection differed between pipelines. Furthermore, only one of the pipelines showed a higher on-study rate of atrophy in the MS group as compared to the NC group.

Clinical change on-study
Within the MS group, there was no significant change in EDSS scores between baseline [mean ±SD   (Table 3). With regard to within group on-study change, significant atrophy was detected during the 5-year period in the putamen (MS group), globus pallidus (both groups), and total DGM (both groups) (all p<0.05,  Table 4 shows scan-rescan variability, both without and with an intervening scanner upgrade. For each data column in the table, 8/11 (72.7%) of the coefficient of variations were less than 1%, which indicates high reliability. This included both the scan-rescan results without an intervening upgrade and a pre-vs. post-upgrade scan-rescan experiment.

Discussion
In this 5-year "real world" 3T MRI study of mildly disabled treated patients with RRMS, although patients began the study with whole brain atrophy compared to normal controls, there was no significant ongoing whole atrophy on-study, compared to healthy controls. However, the DGM (i.e. the caudate nucleus) showed significant atrophy in the MS group over 5 years compared to the rate of volume loss in normal controls. The detectability of caudate atrophy was dependent on the type of automated MRI segmentation pipeline employed. In general, the volumetric DGM measures were not interchangeable between the two automated regional volume segmentation pipelines. Our findings underscore the increased sensitivity gained when assessing DGM vs. global (whole brain) atrophy in monitoring MS. In considering our observation that DGM atrophy was more sensitive to change than whole brain atrophy in MS, a growing body of evidence supports these results. It is well known that the GM is affected early and selectively in the disease course of MS [7,11,[33][34][35][36][37][38]. In addition, several studies indicate that the progression of  (Table 3). Regarding within group on-study change, only the caudate and globus pallidus in the MS group showed significant atrophy during the 5-year period (both p<0.05, Table 3). Furthermore, when examining between group (MS vs. NC) on-study change between baseline and follow-up observations, the rate of caudate volume loss was lower in the MS vs. NC group (p<0.05, Table 3). Data are shown as means with standard deviations. DGM = cerebral subcortical deep gray matter; MS = multiple sclerosis; NC = normal controls; total DGM = cerebral subcortical deep gray matter = thalamus + caudate + putamen + globus pallidus; � p<0.05. GM pathology is not necessarily dependent on the extent of WM lesions, supporting the concept that GM degeneration proceeds in a manner largely independent of WM inflammation. Our study is supported by results indicating that subcortical DGM, but not cortical atrophy develops early in people with RRMS [11,39]. The relevance of DGM has been shown in several studies, such as the observation that selective regional GM, but not global atrophy is an early risk factor for disease progression [40][41][42][43]. These results have implications for planning of clinical trials aiming to demonstrate neuroprotective effects of putative MS therapies, for which DGM atrophy may have a role in evaluating treatment outcome [6,17,44].
For the measurement of regional DGM volumes, our study showed different results depending on the segmentation pipeline employed. Both software analysis pipelines were able to demonstrate significant regional DGM atrophy in both groups. However, the topography of atrophy detection differed between pipelines. Furthermore, only one of the pipelines showed a higher rate of atrophy in the MS group as compared to the NC group. In a recent study, MSmetrix, FreeSurfer, FSL and SPM were compared for differences in brain volumetric segmentation and showed differences among pipelines [27]. Another recent study [26] showed that GM volumes obtained from FreeSurfer, FSL and SPM were divergent, especially for cortical regions, and that these results affected the strength of correlations between regional GM volumes and clinical/cognitive variables. Recent research also showed a similar discordance of results in DGM volume measurements comparing FSL-FIRST and FreeSurfer pipelines [45]. These studies are consistent with our findings.
The reasons for such differences between software pipeline is difficult to pinpoint since they employed fundamentally different methods. FSL-FIRST [46] registers the individual scan to a standard space brain (derived from the MNI-152 atlas) and models the outer surface of Table 4. Volumetric measures from MRI: Scan-rescan reliability.

MS (n = 4), NC (n = 7)
MS (n = 2), NC (n = 1) Detection of 5-year cerebral brain atrophy in multiple sclerosis each DGM structure as a mesh, and, finally, assigns each voxel in the image the appropriate label to indicate the structure. This takes into account local variations in structure surface shape, as well as the presence of neighboring structures. On the other hand, FreeSurfer [47,48] performs volume analysis for the DGM structures in native space. Differences between these software packages could arise from the segmentation itself, the atlas used, or the smoothing kernel used in voxelwise analyses. The lack of a generally accepted gold-standard for regional atrophy measurements including all brain structures, limits the assessment of false negative or false positive voxels. The major implication of these results is the need to maintain consistency in the type of analysis pipeline employed to avoid potentially severe biases that may occur when pooling data from different methods [49]. In conclusion, DGM volumes obtained from different image analysis methods can be very different.

Pipeline Scan-rescan mean COV (%) without an intervening scanner upgrade
There are several limitations of our study to be considered. Care should be exercised in interpreting these results because of the relatively small sample size. Second, our study was only limited to mildly affected individuals with RRMS. The results may not necessarily apply to other stages of MS, such as more active or progressive patients, including those with primary or secondary progressive forms of the disease. Also, due to limited power, we could not properly evaluate the clinical relevance of our results such as how these volumetric biases would affect clinical-MRI correlations or the assessment of therapeutic response.
Supporting information S1 File. Spreadsheet for segmentation data. The spreadsheet includes all study subjects' data including demographic and clinical characteristics, as well as segmentation data calculated from all four pipelines (SPM12, SIENA, FSL-FIRST and FreeSurfer