Reproducibility of Brain Morphometry from Short-Term Repeat Clinical MRI Examinations: A Retrospective Study

Purpose To assess the inter session reproducibility of automatic segmented MRI-derived measures by FreeSurfer in a group of subjects with normal-appearing MR images. Materials and Methods After retrospectively reviewing a brain MRI database from our institute consisting of 14,758 adults, those subjects who had repeat scans and had no history of neurodegenerative disorders were selected for morphometry analysis using FreeSurfer. A total of 34 subjects were grouped by MRI scanner model. After automatic segmentation using FreeSurfer, label-wise comparison (involving area, thickness, and volume) was performed on all segmented results. An intraclass correlation coefficient was used to estimate the agreement between sessions. Wilcoxon signed rank test was used to assess the population mean rank differences across sessions. Mean-difference analysis was used to evaluate the difference intervals across scanners. Absolute percent difference was used to estimate the reproducibility errors across the MRI models. Kruskal-Wallis test was used to determine the across-scanner effect. Results The agreement in segmentation results for area, volume, and thickness measurements of all segmented anatomical labels was generally higher in Signa Excite and Verio models when compared with Sonata and TrioTim models. There were significant rank differences found across sessions in some labels of different measures. Smaller difference intervals in global volume measurements were noted on images acquired by Signa Excite and Verio models. For some brain regions, significant MRI model effects were observed on certain segmentation results. Conclusions Short-term scan-rescan reliability of automatic brain MRI morphometry is feasible in the clinical setting. However, since repeatability of software performance is contingent on the reproducibility of the scanner performance, the scanner performance must be calibrated before conducting such studies or before using such software for retrospective reviewing.


Results
The agreement in segmentation results for area, volume, and thickness measurements of all segmented anatomical labels was generally higher in Signa Excite and Verio models when compared with Sonata and TrioTim models. There were significant rank differences found across sessions in some labels of different measures. Smaller difference intervals in global volume measurements were noted on images acquired by Signa Excite and Verio models. For some brain regions, significant MRI model effects were observed on certain segmentation results.

Introduction
Brain morphometry is an important assessment used in neuroscience research. Chronological changes in brain morphology may be related to aging, neurodegenerative disease, brain insults, or treatment effects [1][2][3][4][5]. Magnetic resonance imaging (MRI), a widely used non-invasive, clinical, diagnostic tool, can provide excellent brain tissue contrast. The high resolution of brain MRI also makes MRI-derived morphometric data ideal for clinical or research purposes. However, due to the complexity of brain morphology, accurate neuroanatomical measurement is difficult and time-consuming even for trained experts. For studies that require quantitative assessment of a large number of subjects or brain regions, automated methods play an important role in providing rapid and robust segmentation, while also minimizing human inter-and intra-rater variability.
Quantitative imaging biomarkers can serve as indicators of normal biological processes, pathogenic processes, or responses to therapeutic intervention [6,7]. Among them, brain MRI morphometry is a promising imaging biomarker for early detection or diagnosis of neurodegenerative and psychiatric disorders [8][9][10][11]. Several software packages have been developed that provide automated segmentation and measurement of brain morphometry [12][13][14][15].
Prospective longitudinal studies of morphometric changes within the brain require both accuracy and reproducibility of automated morphometric measures across sessions. While the accuracy validation of automated neuroanatomical measures against regional manual measurements has been performed [13,16], some studies have reported variable accuracy and reliability across sessions involving MRI-derived measures on the same subjects, and the sources of variance and bias have been investigated [17][18][19][20][21][22][23][24]. The reliability of brain morphometric measures across sessions can be influenced by subject-related, instrument-related, and image processing-related factors. Clinically, we usually need to retrospectively review previously acquired serial brain MRIs of patients to determine if any chronological change in brain morphometry has occurred. However, images are usually reported in a qualitative manner, so that there is little need for well-controlled image acquisition parameters other than image quality. It is also commonplace for hospitals to have multiple scanners and there is no clinical stipulation for follow-up scans to be performed on the same scanner, although such a practice is preferred. In consideration of the need for quantitative comparison of objective measures of disease progression or response to therapeutic treatment, it is important to evaluate the feasibility of using automated segmented measures from previously acquired clinical brain MRI scans for comparison purposes.
Our study had as its goal to validate the cross-session agreement of automatic segmented MRI-derived measures obtained using FreeSurfer software in a group of subjects with normalappearing MR images. While this study was performed using FreeSurfer, the methodology could also be applied to other software packages and other images.

Ethics Statement
This study was approved by the National Taiwan University Hospital Institutional Review Board (IRB: 201207033RIC) and the requirement to obtain informed consent was waived due to its retrospective nature. The patient records were anonymized and de-identified prior to analysis.

Subjects
From December 2011 to December 2013, 14,758 subjects (all older than 20 years of age) received brain MRI examinations in the Department of Medical Imaging at our hospital. These subjects included those referred from clinical departments (clinics, wards, and the emergency department) and volunteers for health screening. There were 2,299 subjects who had more than one brain MRI exam in this period. To limit the variance caused by instrument-related factors, only subjects who received repeat MRI studies on the same scanner were selected for this study.
A neuroradiologist (HML) with 30-years of experience carefully reviewed the medical history and images of all subjects. After excluding cases of dementia, psychiatric disorders, neoplastic disease, ischemic or hemorrhagic brain insults, infectious disease, prior surgical intervention, or poor image quality, 36 subjects remained who were reported as neuroradiologically normal on repeat brain MRI using the same scanner.

MRI Scanners
Five MRI scanners were used at our institute during 2011 to 2013. A routine brain MRI examination contained a 3D T1 pulse sequence on every subject. The sequences used for morphometric analysis in each MRI scanner are listed below. Please note that, although these are the default parameters in the scanning protocols, the technicians sometimes made minor adjustments at the time of scanning for practical reasons.    FreeSurfer (http://freesurfer.net, Athinoula A. Martinos Center for Biomedical Imaging, Harvard-MIT, Boston) is a freely available open source software suite for processing and analyzing human brain MR images [12,13]. With an array of algorithms and tools provided by FreeSurfer, automated or semi-automated analysis and visualization of structural, connectional and functional brain imaging data can be performed using a wide variety of hardware and operating systems. Such data includes segmentation, registration, cortical surface reconstruction, quantification of segmented structures, longitudinal processing, fMRI analysis, and tractography. FreeSurfer allows visualization of different computed measures (such as thickness, area, or volume) over the brain surface after automatic parcellation.
The 64-bit Linux version of FreeSurfer 5.1.0 was used in this study. In brief, the FreeSurfer's preprocessing and segmentation were executed with the command "recon-all -all". "recon-all" is a batch script which involves more than 30 reconstruction routines and can be summarized as follows (The details can be found in https://surfer.nmr.mgh.harvard.edu/fswiki/recon-all and http://surfer.nmr.mgh.harvard.edu/fswiki/ReconAllTableStableV5.1): 1. Motion correction and averaging.
All stages of cortical reconstruction were performed in a fully automated manner. To ensure the reproducibility of the computed data, no manual intervention was performed. After successful automated processing by FreeSurfer, inspection of the snapshots from the final results was performed to ensure that no obvious non-brain parts had been incorrectly segmented as brain tissue. Images of each subject were automatically processed with the longitudinal stream [25] in FreeSurfer to create a reliable unbiased within-subject template space and image, which were used in further processing steps, such as skull stripping, Talairach transforms, atlas registration, spherical surface mapping and parcellations. By using the FreeSurfer longitudinal stream, it can significantly increase reliability and statistical power [25]. After that, tabulated data of the segmented results were collected using two FreeSurfer summarizing scripts (i.e., aparcstats2table and asegstats2table).
After automatic segmentation and quantification, several types of measurement can be used for analysis, including area, volume, std (standard deviation of volume), thickness, thicknessstd (standard deviation of thickness), and meancurv (mean curvature). The neuroanatomical labels in FreeSurfer could be categorized according to the different measurements acquired (i.e., area, thickness, or volume), the sidedness (left, right), cortical parcellation atlas (Desikan-Killiany Atlas [aparc] or Destrieux Atlas [a2009s]), and anatomical locations.
In this study, area, volume, and thickness measurements of segmented anatomical labels from the two MRI scans for each subject were selected for label-wise longitudinal analysis. Among them, the volumes of cortical gray matter (label: CoretexVol), deep gray matter (label: SubCortGrayVol), and white matter (label: CorticalWhiteMatterVol) were selected for analysis of the global measure of volume. For cortical area measurements, there were 150 labels for the Destrieux Atlas and 70 labels for the Desikan-Killiany Atlas. For cortical thickness measurements, there were 298 labels for the Destrieux Atlas and 138 labels for the Desikan-Killiany Atlas. For volume measurements, there were 148, 68, 76, and 55 labels for Destrieux Atlas, Desikan-Killiany Atlas, subcortical white matter, and miscellaneous structures, respectively. In total, there were 1003 labels for all parcellation/segmentation types of area, thickness, and volume measurements generated by FreeSurfer.

Statistical Analysis
Intraclass correlation coefficient (ICC) [26][27][28] is a descriptive statistic commonly used to assess the reliability when quantitative measurements are organized into groups. It can be interpreted as the proportion of the total variance that is due to variation between groups. It measures the degree to which the units in the same group approximate each other. ICC will approach 1.0 when within-target variation is small, suggesting consistent results from each measurement. ICC will be negative when the between-target variation is relatively small compared to the within-target variation. An examination or a test is considered reliable if consistent results can be obtained under similar methodology. There are two kinds of reliability: consistency or absolute agreement. Because our targets of interest are absolute measurements (e.g. area, thickness, and volume), ICC was computed to assess the reliability of the results across sessions using a two-way random model with measures of absolute agreement. To approximate the actual distribution of ICC values, we employed 10,000 bootstrap-based resampling with an equal sample size (N = 20) for each model to computed means and 95% confidence intervals.
The Wilcoxon signed rank test is a non-parametric paired difference test. It is used to test the null hypothesis (i.e., the median of the differences between the paired observations is zero) versus the alternative hypothesis (i.e., the median is not zero). While the sample size was small, it can be used as an alternative to the paired t-test. After parcellation of the two sessions of brain MRI data, comparison of the paired segmented results was performed using the Wilcoxon signed rank test assuming a 5% and a 1% level of significance (α = 0.05 and 0.01, respectively).
Bland-Altman (BA) analysis is a graphical method used in analyzing the agreement between two measurement methods [29,30]. This method plots the difference against the average of each data pair between two methods of measurement. The 95% limits of the mean difference were calculated and plotted for visual judgment of agreement. It was expected that 95% of differences were included in the 95% limits between the two measurement methods. A smaller range between these two limits indicates better agreement but the threshold of the range depends on the clinical context. In this study, Bland-Altman analysis was used for a global measure of selected brain volume.
A dimensionless measure of absolute percent change in segmentation results for a structure with respect to its average was used to estimate the variability error. The across-session variability error was estimated as follow: Kruskal-Wallis test, a non-parametric method used for comparing two or more samples that are independent, was used to test the MRI model effect on the ICC and variability error. All statistical analyses were performed using R version 3.1.2 [31].

Subjects and MRI
After careful exclusion of any known brain disease or morphologic abnormality, 36 subjects remained who had repeat brain MRI scans using the same machine. The MRI results of Signa HDx were not analyzed in this study because only two subjects had repeat scans during this period. A total of 34 subjects (15 males and 19 females) representing 68 sessions of MRI scanning were included in this study. Their median age was 54.5 years (range: 34-85 years) ( Table 1).

Intraclass Correlation Coefficient
The label-wise ICCs from the repeat MRI scans are shown in Fig 1. For

Wilcoxon Signed Rank Test
The p values of the Wilcoxon signed rank test for each label of each measurement produced by the FreeSurfer are summarized in Fig 2. Among all labels, significant differences were found across repeat MRI scans in 121 (12.1%) and 23 (2.3%) labels for 5% and 1% significance level test, respectively. The labels with significant differences (α = 0.01) are summarized in Table 2 and Fig 3, especially some measurements at some measurements around the bilateral central gyri of the Verio model and scattered labels of the Signa Excite model.

Bland-Altman Analysis (Global Measure of Volume)
In addition to the ICC and Wilcoxon signed rank test, selected global measures of brain volume (cortical gray matter, subcortical gray matter, and white matter) were also analyzed using the Bland-Altman (mean-difference) plot of each pair of measurements (Fig 4 and Table 3). Greater mean differences and wider agreement intervals were observed in the TrioTim model.

Across-Session Variability
The across-session variability errors of all segmentation results were computed. Among them, the results of three global measures of brain volume (cortical gray matter, subcortical gray matter, and white matter) are summarized in Table 4. In each MRI scanner, the mean reproducibility error was estimated as a mean across all subjects. The last column shows the effects of analysis on the various structures by averaging the across-MRI model reproducibility errors across MRI models. Although no significant MRI model effect was found among these selected global measures, significant MRI model effect was shown on 125 (12.4%) and 14 (1.4%) labels among all segmentation results for 5% and 1% significance level tests, respectively.

Discussion
This study evaluated the automatic segmented results obtained using FreeSurfer on a group of normal subjects who received repeat MRI scans over a short time interval (up to 15.5 months). Unlike most prospective studies, in which careful tuning of experimental conditions is expected, the subjects in this study were retrospectively selected from a clinical database. Variable degrees of reproducibility were noted using different MRI machines, shown as a wide range of ICCs (Fig 1). Although this is a study of short-term repeat MRI scans in normal subjects, significant differences were still found in some labels of different measures. Different variances were found in the results of the BA analysis from different machines.  Comparison of longitudinal follow-up results from different MRI systems is possible, however, there are some limitations regarding image quality, which may be influenced by acquisition sequences, magnetic field strength, scanner software, or the type of scanner used [18,22]. Although it has been suggested that using registration-based algorithms can provide better reproducibility, for longitudinal follow-up, MRI acquisitions should be performed at the same imaging site [18]. The automatic segmentation software used in this study, FreeSurfer, is implemented with segmentation-based algorithms. In this study, repeat longitudinal follow-up examinations in a group of neuroradiologically normal subjects were performed on the same MRI scanner, however, less agreement and large mean differences in the morphometric data across sessions were found with certain MRI scanners, as shown in Figs 1 and 4. While the vast majority of ICCs for each label from repeat scans were high in scans from the Signa Excite and Verio MRI scanners, a certain portion of low ICCs were noted in the Sonata and TrioTim models. Varying degrees of agreement between scanner types were also found in this study using Bland-Atman analysis and, similarly, a MRI model effect on the absolute percent difference was revealed in segmentation results using the Kruskal-Wallis test. Individual comparison with normal databases is important for the purposes of reporting, as the confidence interval of normal data can serve as a reference to improve the accuracy of interpretation. While many different scanners could be used to create an MRI database, considerable across-scanner variability might exist and, thus, it is necessary to quantify the differences and to determine the significance of the variability using the reference data.
Although both Signa Excite (GE) and Sonata (Siemens) are 1.5T MRI models, the agreement and variance in generated FreeSurfer results were quite different between models. Better agreement and smaller variance could be seen in the scans from the Signa Excite system. A  difference in scanning parameters or vendors, and magnetic field inhomogeneity due to aging of the machine may have contributed to these differences. On the other hand, although Trio-Tim and Verio are both 3T MRI models from the same manufacturer (Siemens), the ability to reproduce quantifiable results from the two machines differed significantly. While there are various accelerating methods used for high-quality high-resolution 3D T1 images, MP-RAGE and IR-SPGR are the sequences of choice for brain morphometry because of their superior signal to noise ratio and gray-white matter contrast. The selection of acquisition sequences for regular MRI scanning in a clinical site relies on a balance between scanning time, tissue contrast, resolution, sensitivity to motion, and availability of the sequence. For a clinical examination, the sequences are optimized to detect abnormalities rather than to obtain the best segmentation results by automated processing software. It is common not to use MP-RAGE or IR-SPGR for regular brain MRI scanning. Although the sequences used in this study (SPGR, FLASH, TurboFLASH) are alternatives to MP-RAGE/IR-SPGR [32], the graywhite matter contrast may effect the ability of the segmentation software to optimally perform. Thus, it is crucial to choose an optimized sequence when using brain morphometry metrics as a quantitative imaging biomarker in clinical trials, or the suboptimal segmentation may influence the accuracy of the results.
Results of neuroanatomical labeling using two cortical parcellation systems, Desikan-Killiany Atlas [33] and Destrieux Atlas [34], were provided in FreeSurfer. Each cortical vertex of the brain was classified using gyral-based mapping for the Desikan-Killiany Atlas, which contains 68 gyral-based labels, 34 for each hemisphere. For the Destrieux Atlas, each vertex can be categorized as sulcal or gyral, and then subparcellated into 148 labels, 74 for each hemisphere. Since there were more labels and a more complicated classification system for use with the Destrieux Atlas, it was more difficult to maintain accuracy and consistency in segmentation results. Compared with the Desikan-Killiany Atlas, the increased number of labels used by the Destrieux Atlas had a significant effect, based on the Wilcoxon signed rank test. The MRI radiographers are trained to acquire high quality scans for lesion detection in their daily practice. However, it is possible that a more diligent MRI technician who ensures higher quality scans on one particular scanner while others might accept minor quality degradation as long as the quality is good enough for a radiologist to make a diagnosis. After reviewing the original MR images of the outliers shown in Fig 4, a small amount of blurring caused by motion artifact was found in the image of one outlier, and this can be avoided by a sophisticated radiographer in a prospective quantitative brain morphometric study. Given the retrospective nature of this clinical study, between-scanner differences could also be partly contributed by inter-radiographer variability.
The ultimate goal of medical researchers is to apply the observed findings to the clinical situation, and such an application relies on accuracy and reproducibility of the segmented results. Appreciable differences in subcortical brain volumes in within-a-day repeat MRI scans on the same subject could still be found, even when using the same scanner and acquisition parameters, because of changes in image orientation, pre-scan parameters, and magnetic field instability [19]. Similar scan-rescan measurement differences and variability in scanning parameters/ environments have been reported [22,35]. In the clinical setting, more variability will be encountered and, therefore, a preliminary validation of data accuracy and reproducibility is necessary. Otherwise, false positive labels of significant differences will be reported if an unreliable MRI system is used. With the implementation of automated segmentation methods, we can quantify and determine the significance of the effect caused by scanner-related factors. Furthermore, we can use the same processing pipeline to investigate whether these fundamental differences can be removed or minimized. An estimated scanner-related effect should be used for the correction of segmentation results.
Our study had several limitations including its retrospective nature. In addition, this study was conducted at a single institution. Future prospective studies performed at multiple institutions involving larger cohorts are necessary to confirm our results. The inclusion of "normal" subjects relied on our neuroradiologist's judgment regarding the absence of pathological findings on the MR images, however, it is possible that subjects with neurodegenerative disease did not exhibit pathological findings but, rather, morphometric changes within the brain. Therefore, a prospective validation study on normal subjects should be conducted. Theoretically, segmentation of isometric voxels should be less sensitive to changes in image orientation compared with non-isometric voxels. Non-isometric or near-isometric voxel sizes were used in this study and the impact of voxel isometry on the accuracy and reproducibility of the segmentation should be further explored. In this study, the effect of MRI scanners was determined by detecting differences in each individual subject but certain regions of statistical significance in a particular individual may not be clinically relevant. Different pathologic conditions, such as neurodegenerative diseases, may have variable effects on brain morphology and the size of the effect should be determined by group analysis [36][37][38]. Furthermore, since discrepancies in the results derived from different segmentation methods have been shown to reach the same order of magnitude as volume changes in disease [18,39], the effect of different methods on the reliability of the clinical MRI images should also be investigated in a future study.

Conclusion
Short-term scan-rescan reproducibility of automatic brain MRI morphometry is feasible in the clinical setting, but validation of scan-rescan reproducibility for each MRI scanner is suggested before conducting such a study, or prior to using such software for retrospective reviewing.
Supporting Information S1 File. The compressed archive containing the demographic data and segmentation results of thickness, area, and volume measurements by FreeSurfer. (ZIP)