Accuracy and Reliability of Automated Gray Matter Segmentation Pathways on Real and Simulated Structural Magnetic Resonance Images of the Human Brain

Automated gray matter segmentation of magnetic resonance imaging data is essential for morphometric analyses of the brain, particularly when large sample sizes are investigated. However, although detection of small structural brain differences may fundamentally depend on the method used, both accuracy and reliability of different automated segmentation algorithms have rarely been compared. Here, performance of the segmentation algorithms provided by SPM8, VBM8, FSL and FreeSurfer was quantified on simulated and real magnetic resonance imaging data. First, accuracy was assessed by comparing segmentations of twenty simulated and 18 real T1 images with corresponding ground truth images. Second, reliability was determined in ten T1 images from the same subject and in ten T1 images of different subjects scanned twice. Third, the impact of preprocessing steps on segmentation accuracy was investigated. VBM8 showed a very high accuracy and a very high reliability. FSL achieved the highest accuracy but demonstrated poor reliability and FreeSurfer showed the lowest accuracy, but high reliability. An universally valid recommendation on how to implement morphometric analyses is not warranted due to the vast number of scanning and analysis parameters. However, our analysis suggests that researchers can optimize their individual processing procedures with respect to final segmentation quality and exemplifies adequate performance criteria.


Introduction
Automated brain segmentation algorithms segment a structural magnetic resonance imaging (MRI) image into different tissue classes. In general, a MRI image is segmented into gray matter, white matter, and cerebrospinal fluid. Based on this segmentation, methods are available to calculate several neuroanatomical measures, for example gray matter volume, gray matter density, cortical thickness, or cortical curvature. Researchers use these measures to investigate differences in brain structure between groups or to investigate changes in brain structure over time. Phenomena that are investigated include learning processes [1], language lateralization [2], psychosis [3], mild cognitive impairment [4,5], aphasia [6], alexithymia [7], post-traumatic stress disorder [8], Huntington disease [9,10], depression [11], autism [12,13], and schizophrenia [14]. The use of automated segmentation algorithms is desirable, as these algorithms are (i) much faster than manual segmentations and (ii) user independent, that is, they do not depend on expert knowledge in neuroanatomy. However, significant challenges exist as differences in brain structure between groups, or changes within subjects are often very subtle (please see, e.g., [15,16]). Therefore, it is crucially important that (i) automated segmentation algorithms are able to precisely determine the exact amount of, for example, gray matter tissue in an MRI image (cf. accuracy), and that (ii) they produce similar results, when applied to different images of the same person (cf. reliability). At the moment, however, too little is known about the accuracy and reliability of current automated segmentation algorithms.
Clark et al. [17] addressed the problem of reliability of different automated segmentation algorithms. Combining different algorithms for intensity correction, skull-stripping and segmentation, Clark et al. [17] produced a large number of different processing pathways and tested these pathways on twenty MRI images taken from the same subject. They found that the most ''optimal'' processing pathway yielded volume estimates that were on average three times less variable than those estimates calculated by less ''optimal'' pathways. They also demonstrated that the choice of the segmentation algorithm had the greatest impact on the variability of the final segmentation, whereas intensity correction and skull-stripping algorithms had little effect on the overall tissue segmentation reliability. In contrast to those findings, Fein et al. [18] showed that skull-stripping may greatly improve the power of structural brain analysis. Acosta-Cabronero et al. [19] evaluated the impact of skull-stripping and intensity correction algorithms on the subsequent segmentation. In accordance with the findings of Fein et al. [18], they reported a large influence of those preprocessing steps.
In 2009, Klauschen et al. [20] conducted a systematic evaluation of different segmentation algorithms. They used simulated brain data that were generated based on varying brain anatomy and varying image quality, as well as real images from nine different individuals and test-retest images of 48 individuals. They tested the performance of three commonly used segmentation algorithms, provided by software packages SPM5, FSL, and FreeSurfer. Within-segmenter analyses revealed volume differences greater than 15%. Between-segmenter comparisons showed an average discrepancy of 24% for real MRI images. The results of Klauschen et al. [20] suggested that automated brain segmentation algorithms might be seriously limited in the fine discrimination of tissue classes. Most importantly, their study casted serious doubts on the capability of automated segmentation algorithms to detect changes in brain structure in longitudinal studies.
To provide information to the community regarding which gray matter segmentation procedure they can build upon, we present a systematic evaluation of accuracy and reliability of standard gray matter segmentation algorithms. Whereas Clark et al. [17] emphasized the comparison of different processing pipelines with permuting preprocessing steps, and whereas Klauschen et al. [20] tested within and between-segmenter reliability and accuracy of three software packages, our investigation expands the work by Clark et al. and Klauschen et al. by providing a comprehensive investigation of both segmentation pipelines and within and between-segmenter accuracy and reliability using the latest versions of commonly used segmentation algorithms. Importantly, we provide measures of accuracy obtained from real T1 MRI images. To our knowledge this has not been done before in a systematical manner. The fact that we tested the latest versions of available segmentations procedures is also of particular importance, because, up to now, all studies concerned with the evaluation of automated segmentation [17,20] used segmentation algorithms that were subjected to substantial development since.
In the current study, we evaluated the segmentation algorithms provided by (i) SPM8, (ii) VBM8, (iii) FSL, and (iv) FreeSurfer separately and in combination with algorithms for intensity correction and skull-stripping. We determined accuracy in terms of the Dice coefficient computed for the comparison of ground truth images and corresponding gray matter segmentations in simulated and real T1 brain images. We evaluated reliability in terms of standard deviation, coefficient of variation, and reliability coefficient of gray matter segmentations on real T1 images. In comparison to previous studies, our focus was on the simultaneous investigation of accuracy and reliability in combination with a systematic evaluation of the influence of each processing step on segmentation quality. Thus, we were able to examine (a) which processing step has the largest influence on segmentation accuracy both in simulated and real T1 MRI images, (b) how accuracy and reliability are linked, (c) how results from simulated and real T1 images differ, and (d) how preprocessing steps and segmentation algorithms interact.

Data Sets
To investigate the accuracy of different segmentation pathways we used (i) twenty simulated T1-weighted MRI images and corresponding discrete anatomical models provided by the Simulated Brain Database (http://mouldy.bic.mni.mcgill.ca/ brainweb/; ''BrainWeb data set'') and (ii) 18 real T1-weighted MRI images with expert segmentations of 43 individual structures from the Internet Brain Segmentation Repository (''IBSR data set''). The latter images and their manual segmentations were provided by the center for Morphometric Analysis at Massachusetts General Hospital and are available at http://www.cma.mgh. harvard.edu/ibsr/(IBSR version 2.0).
The BrainWeb data set is based on digital phantoms that were made from twenty healthy adults [21,22,23,24]. The images are T1-weighted simulated data with the following parameters: spoiled FLASH sequence, TR = 22 ms, TE = 9.2 ms, flip angle = 30u, 1 mm isotropic voxel size (3% noise, 0% intensity-inhomogeneity). The corresponding discrete anatomical models consist of an integer value at each voxel that represents the tissue which contributes most to that voxel. We created binary gray matter masks of each of the discrete models and used these masks as ground truth for the corresponding simulated images. The gray matter signal-to-noise ratio in this data set ranged from 47 to 59 (M = 53, SD = 3.1; see Table 1).
The IBSR data set consists of high-resolution, T1-weighted volumetric images (resolution at least 16161.5 mm) from 14 male and four female subjects (age: M = 38, SD = 22.4, including four individuals characterized as juvenile). These images have been reoriented into the Talairach orientation and processed by the Center for Morphometric Analysis biasfield correction routines. Experts segmentations of the principle brain structures include: 3rd ventricle, 4th ventricle, brain stem, and bilaterally: accumbens area, amygdala, anterior amygdala, caudate nucleus, cerebellum cortex, exterior cerebellum, cerebellum white matter, cerebral cortex, exterior cerebral, cerebral white matter, hippocampus, inferior lateral ventricle, lateral ventricle, palladium, putamen, thalamus proper, ventral diencephalon, and vessels. The segmentations are the result of a manually-guided, semi-automatic segmentation technique conducted by a trained expert. Segmentations are provided as structure outlines and as filled volumes. The latter were used in the current study. For the filled volumes, fill codes represents the various structures that were segmented. For the purpose of the current study, the ''trinary'' representations of the segmentations were used. In these images voxel values have been mapped from the code-to-structure codes into the basic tissue types: background, cerebrospinal fluid, gray matter and white matter. Binary masks for gray matter were created to serve as ground truth for the 18 images. The signal-to-noise-ratio for gray matter in this data set ranged from 16 to 95 (M = 47, SD = 22.4; see Table 2).
To determine the reliability of different segmentation pathways we acquired ten MRI images of one individual (male, 40 years old; ''Single Subject data set''). The first five images were acquired on five different days between October 29 th and November 15 th 2010. The sixth image was acquired on March 28 th , 2011. The remaining images were acquired in different sessions on May 19 th , 2011. The images were acquired on a Siemens Trio (A Tim System, 3 Tesla) with software version Syngo MR B17. Acquisition parameters of the images were as follows: 3D MPRAGE sagittal acquisition of 176 slices (1 mm thickness) with a field of view of 2566256 mm and a matrix of 2566252 resulting in isotropic voxels of 16161 mm 3 ; TR = 1900 ms, TE = 2.52 ms, TI = 900 ms, flip angle = 9 , pixel bandwidth = 170 Hz, 12 channel head RX-coil, parallel imaging factor 2 (GRAPPA). The signal-to-noise-ratio for gray matter in this data set ranged from 126 to 144 (M = 137, SD = 7.1; see Table 3).
Additionally, we used the reliability data set provided by the Open Access Series of Imaging Studies (www.oasis-brains.org; ''OASIS data set'') [25]. This data set contains twenty subjects scanned on subsequent visits within ninety days. In contrast to Klauschen et al. [20], we only used a subset of the images provided, namely subject numbers 61, 92, 111, 145, 150, 156, 236, 249, 285, and 379 (mean age: 22.7 years, SD = 4.7). We chose these particular subjects, because they were scanned twice within twelve days at maximum (M = 4, SD = 3.8). That way, we ensured that the two scans were maximally similar. The signal-to-noise ratio in gray matter for this data set ranged from 18 to 32 (M = 25, SD = 4.2; see Table 4).
Our study primarily used simulated data and data publicly available. The single subject images were scans that were obtained in the context of continuous quality management at the scanner facility of the Department of Psychiatry and Psychotherapy, Philipps-University Marburg. The images were obtained from J. S., who made the data available for our study.

Algorithms
We used two preprocessing steps in our analyses: intensity correction and skull-stripping. For intensity correction we used the nonparametric nonuniform intensity normalization (N3) algorithm [26] and for skull-stripping (i) the ''watershed'' (WS) algorithm of FreeSurfer [27] and the BET algorithm of FSL (version 2.1) [28].
For gray matter segmentation we used (1)  Segment performs segmentation, bias correction and normalization in one step (cf. ''Unified Segmentation'') [29]. The underlying generative model includes a correction for intensity non-uniformity and is estimated for a maximum a posteriori solution. New Segment, currently work in progress, is an extension of the unified segmentation approach that uses an improved registration model and an extended set of tissue probability maps [30]. The VBM8 segmentation algorithm uses a maximum a posteriori technique, together with a partial volume estimation and two denoising methods [31]. Additionally, this algorithm integrates the DAR-  TEL normalization [32]. FAST uses a hidden Markov random field model and an associated Expectation-Maximization algorithm. The algorithm also corrects for intensity non-uniformities [33]. FreeSurfer is a set of tools for the analysis of structural and functional brain imaging data. In its processing stream it allows for subcortical segmentation and cortical parcellation based on prior cortical modeling and a Gaussian classifier atlas [34].
Study Design and Implementation Figure 1 depicts the study design. By combining the algorithms mentioned above for intensity correction (2 possibilities: no intensity correction, N3), skull-stripping (3 possibilities: no skullstripping, BET, WS), and segmentation (5 possibilities: Segment, New Segment, VBM8, FAST, FreeSurfer) we created thirty gray matter segmentation pathways in total. Seven of the pathways were not considered further in our study because we regarded these pathways as not relevant in practice. As FreeSurfer's default segmentation procedure already includes N3 intensity correction and WS skull-stripping, combinations with any of the preprocessing steps would have been redundant. In the case of FAST, only segmentation pathways that included a skull-stripping step were feasible, because the algorithm assumes brain-extracted data. Thus, we retained a testable pathway total of 23 different processing pathways. We processed each data set with each pathway, resulting in 1564 total calculations of gray matter maps.
Intensity correction was implemented with FreeSurfer's mri_-nu_correct.mni program. Skull-stripping was implemented using FSL's bet and FreeSurfer's mri_watershed program with no additional parameters selected except those specifying the input and output volume. Segmentation via Segment and New Segment was implemented in the batch tool of SPM8 with standard parameters. For VBM8, we also used the batch tool of SPM8; here, we used the standard parameters except that we explicitly specified that the output should be saved in native space. For the segmentation with FreeSurfer we used the recon-all -all command line command. As FreeSurfer does not provide a gray matter map right away, we created gray matter masks from the results of the subcortical segmentations and the cortical parcellations and combined these two masks to get the desired gray matter map.

Evaluation
To quantify the accuracy of gray matter segmentations, we used the Dice coefficient (DC) [35], a similarity measure related to the Jaccard index. The DC is commonly used to determine accuracy of segmentation methods in neuroimaging settings [36,37,38] and is defined as the size of the union of the segmentation result and the ground truth: DC = 2TP/((FP + TP) + (TP + FN)), that is, the set of True Positives (TP) is divided by the average size of the segmentation result (False Positives (FP) + True Positives (TP)) and the ground truth (True Positives (TP) + False Negatives (FN)). A DC of 0 indicates no overlap; a value of 1 indicates perfect agreement. Using the DC, we evaluated the accuracy of the standard implementations of Segment, New Segment, VBM8, FAST, and FreeSurfer. With regard to the BrainWeb data set, we resliced the gray matter maps produced by the segmentation pathways to the corresponding ground truth images with a trilinear interpolation. Next, we compared the resliced gray matter maps (binarization threshold: p.0.5) and the corresponding ground truth images voxel-wise to calculate the DC. With respect to the IBSR data set, segmentation results could be directly compared to the corresponding ground truth images, because original T1 images and ground truth images had the same resolution. Only in case of FreeSurfer, gray matter maps were again resliced to fit the resolution of the corresponding ground truth images. In addition to the DC, for each of the five standard segmentation algorithms, we determined the average false positive rate (f p ; cf. specificity) and the average false negative rate (f n ; cf. sensitivity). Moreover, to examine the impact of the choice of the binarization threshold, we also evaluated the gray matter maps using p.0.10 and p.0.90.
To assess the reliability of the five standard segmentation algorithms, we initially used the Single Subject data set. We calculated the variability in segmented gray matter volumes in terms of the standard deviation in mm 3 and in terms of the coefficient of variation c v , which is defined as the ratio of the standard deviation to the mean. Next, we calculated the reliability coefficient r for the segmented gray matter volumes measured for the OASIS data set. For this data set, we also computed the average deviation in volume (in %) between the first and second scan.
Finally, to determine which processing factor had the largest impact on segmentation accuracy, we computed separate univariate, three-way analyses of variance (ANOVAs) with according pairwise comparisons for the BrainWeb data set and the IBSR data set. In these analyses, DC was the dependent variable and Intensity Correction, Skull-Stripping, and Segmentation were the respective factors for repeated measures. To create a balanced design for statistical analysis, we excluded all pathways that used FreeSurfer for segmentation and all pathways that did not use any skull-stripping. Thus, we computed 2 (Intensity correction: none, N3) 62 (Skull-stripping: BET, WS) 64 (Segmentation: Segment, New Segment, VBM8, FAST) Greenhouse-Geisser corrected ANOVAs with repeated measures on all factors. To further determine which of all five segmentation algorithms tested achieved the highest accuracy, we additionally computed one-way ANOVAs for the factor Segmentation (Segmentation: Segment, New Segment, VBM8, FAST, FreeSurfer) separately for the BrainWeb and the IBSR data set. Likewise, we computed a one-way ANOVA for the factor Skull-Stripping (Skull-Stripping: none, BET, WS) to test whether this preprocessing step actually increased or decreased segmentation accuracy in comparison to no prior brain extraction.

Accuracy
Panel A of Figure 2 shows that on the BrainWeb data set FAST, VBM8, Segment, and New Segment reached an average DC greater than 0.93.  Figure 2, panel C and D, illustrates that all segmentation algorithms were especially prone to reduced sensitivity, that is, they tended to underestimate gray matter volume (BrainWeb data set: all f n s .5%; IBSR data set: all f n s .20%). At the same time, all segmentation algorithms showed high specificity (BrainWeb data set: all f p s ,1.5%; IBSR data set: all f p s ,2%). On real data (cf. IBSR data set), NewSegment demonstrated the highest sensitivity (f n = 24.3%) and FreeSurfer the lowest (f n = 49.6%).
For p.0.10 instead of p.0.50 as binarization threshold, all segmentation algorithms yieled similar accurary on the BrainWeb data set (DCs ranged from 0.88 to 0.90). For the IBSR data set, however, Segment, New Segment, VBM8 and FAST yielded comparable results (DCs ranged from 0.84 to 0.87), whereas FreeSurfer showed significantly decreased accuracy (0.66). For p.0.90, on the BrainWeb data set, again all segmentation algorithms demonstrated comparable accuracy (DCs ranged from 0.79 to 0.83). On the IBSR data set, however, only Segment and New Segment showed feasible accuracy (DCs 0.70), whereas VBM8, FAST, and FreeSurfer Figure 1. Overview of the study design. In total we processed fifty data sets: (i) twenty simulated brains of the Simulated Brain Database with different anatomical models (''BrainWeb data set''), (ii) 18 different real subjects with corresponding expert segmentations (''IBSR data set''), (iii) ten T1-weighted scans of the same individual (''Single Subject data set''), and (iv) ten pairs of images of subjects who were scanned twice within a maximum of twelve days (''OASIS data set''). We created in total thirty segmentation pathways where each consisted of: An intensity non-uniformity correction preprocessing step (consisting of no intensity correction or N3), a skull-stripping preprocessing step (consisting of no skull-stripping, BET, or WS), and the segmentation of gray matter (via Segment, New Segment, VBM8, FAST, or FreeSurfer). Once created, we determined that 23 of the total constructed segmentation pathways were feasible for evaluation and these were investigated in the analysis (infeasible pathways are represented with a dot as end marker). To determine the accuracy of the different segmentation pathways we calculated the Dice coefficient for the gray matter maps and corresponding ground truth images for the twenty simulated brains and the IBSR data set. We tested the reliability of the segmentation pathways by (i) determining the variability in terms of standard deviation and coefficient of variation with respect to gray matter volume on the Single Subject data images, and (ii) by calculating the test-retest reliability with respect to gray matter volume for the OASIS data set.

Reliability
As shown in Figure 2, panel E, FreeSurfer showed by far the least variability in segmented gray matter volumes calculated for the Single Subject data set (SD = 4504 mm 3 , c v = 0.6%). VBM8 yielded the second most reliable results (SD = 9998 mm 3 , c v = 1%), whereas FAST (SD = 26583 mm 3 , c v = 3%) and Segment (SD = 26651 mm 3 , c v = 3%) showed the largest variability in segmented gray matter volumes. The mean segmented gray matter volumes measured by the five standard segmentation algorithms ranged from 731379 mm 3 (FreeSurfer) up to 820202 mm 3 (New Segment). Thus, the maximum discrepancy between the different segmentation algorithms was 11%. With the exception of FAST, all segmentation algorithms showed very high test-retest reliability on the OASIS data set (all rs .0.97; please see

Accuracy of Current Segmentation Algorithms
In the current study, the gray matter segmentation algorithms Segment, New Segment, VBM8, and FAST achieved very high accuracy on simulated T1-weighted MRI images (all DCs.0.93) and good accuracy on real T1-weighted MRI images (all DCs ..79), FreeSurfer, however, only achieved a mean DC of 0.88 on simulated T1 data and a mean DC of.58 on real T1 data. In comparable MRI settings, DCs commonly range between 0.75 and 0.97 [36,37,38,39,40]. From a practical point of view, the average DCs of Segment, New Segment, VBM8, and FAST are closely comparable. Only FreeSurfer's accuracy must be considered substantially lower in comparison to the other segmentation algorithms (please see below). Our findings are in agreement with the results of Klauschen et al. [20], who demonstrated that FAST and Segment have a similar level of sensitivity for gray matter on simulated T1 images (Klauschen et al.: FAST: 91%, Segment: 90%; current study: FAST: 94%, Segment: 92%). Importantly, however, our results also demonstrate that sensitivity on real T1 images is substantially lower than on simulated data (e.g., FAST: 70%, Segment: 75%). Our results are also in accordance with Klauschen et al.'s finding that FreeSurfer performs substantially worse than other segmentation algorithms (sensitivity Klauschen et al.: 83%; current study: 82% (simulated T1 data), 50% (real T1 data)). Notably, we also reproduced the findings of Klauschen et al. [20] in that all segmentation algorithms underestimate the actual gray matter volume. This suggests that, in terms of accuracy, the latest algorithmic advancements have not improved segmentation accuracy significantly.

Reliability of Current Segmentation Algorithms
VBM8 and FreeSurfer demonstrated the most reliable results, whereas FAST showed highly variable results. In terms of testretest reliability, all segmentation algorithms showed almost perfect agreement in segmented gray matter volume. However, the test-retest reliability coefficient for FAST was r = 0.90, equal to an average deviation of 4% in segmented gray matter volume between the first and second scan. All other segmentation algorithms demonstrated a reliability coefficient of at least 0.99. This finding suggests that, of all tested segmentation algorithms, FAST is most sensitive to varying image quality. FreeSurfer and VBM8, on the other hand, were the least sensitive to noise factors introduced by different scan sessions. Nevertheless, VBM8 and FreeSurfer still showed an average volume difference between the first and second scan of 2%, or 1% respectively. Taken together, these findings have two important implications: (1) Despite high test-retest reliability, segmentation pathways might still show considerable variations in segmented gray matter volume when several scans of the same subject are segmented. Thus, in accordance with the conclusions of Klauschen et al. [20], our findings further suggest that even segmentation algorithms, which are considered both very accurate and very reliable, still introduce a ''segmenter-factor'' of up to 3%. This factor has to be considered when morphometric studies of the brain are planned and particularly when results of such studies are interpreted. (2) The fact that we observed pronounced differences in mean segmented gray matter volumes between the different segmentation algorithms strongly emphasizes that findings of segmentation studies that used different segmentation algorithms or different segmentation procedures respectively are not easily comparable.

Tests on Real Versus Tests on Simulated MRI Images
To our knowledge, our study is the first to investigate segmentation accuracy on real T1-weighted MRI images. Results obtained from simulated images are always limited in their generalized application because simulated images cannot capture the full complexity of real MRI images. Nevertheless, the use of simulated images provides a feasible way of evaluating accuracy, because perfect ground truth images exist in this case. However, as can be seen for example in the case of FAST, it is not sufficient to use simulated MRI images to get an idea of how a segmentation algorithm will perform on real data sets. It is of crucial importance to perform tests on both simulated and real MRI data sets, as one may not know all factors that influence the performance of automated segmentation algorithms beforehand. In the current study, for example in the case of FAST, only tests on real data sets revealed that the algorithm is highly sensitive to changes in image quality, and only tests on real T1 images could demonstrate that in practice gray matter sensitivity of segmentation algorithms may be up to five times smaller than suggested by evaluations on simulated T1 images. Likewise, Klauschen et al. [20] used simulated data sets of the same subject with variable image quality. In their analysis FreeSurfer showed the largest variability in segmented gray matter volume while FAST demonstrated a variability that was significantly lower by comparison. Notably, in our study, FreeSurfer showed practically no variability whereas FAST showed the highest variability on the Single Subject data set. This suggests that even using simulated MRI images with varying image quality cannot replace systematic evaluation of automated segmentation algorithms on real data sets.

The Impact of Processing Steps
In our study, intensity correction and skull-stripping algorithms applied prior to gray matter segmentation had no impact on later segmentation accuracy that would be of practical relevance. Similarly, Clark et al. [17] found no pronounced differences in segmentation reliability due to intensity correction on their single subject data set. Klauschen et al. [20] also reported no significant influence of skull-stripping algorithms on segmented gray matter volume. Fein et al. [18] reported increased sensitivity in gray matter segmentation for skull-stripped images. However, their results were obtained from SPM2, whose segmentation algorithm suffered from an inaccurate normalization to a T1 brain template and associated problems with the accurate extraction of the brain. Since the implementation of the Unified Segmentation approach [29] these issues obviously do not exist any longer, as can be seen from the fact that Segment's accuracy did not profit from skullstripping. Acosta-Cabronero et al. [19] demonstrated that skullstripping may improve SPM5's segmentation accuracy. However, they used skull-stripping prior to intensity correction and furthermore applied algorithms with customized parameters. Because of this, it is difficult to directly compare their results with the results of the current study.
Our results indicate that in particular Segment is sensitive to the type of skull-stripping applied. More importantly, however, our results suggest that skull-stripping, may actually decrease segmentation accuracy. This effect might be due to an inaccurate skullstripping process that cuts out parts of the brain or leaves parts of the skull in the image. These shortcomings could of course be corrected by manual editing. However, in our study, we explicitly wanted to concentrate on fully automated procedures that may be chosen by the average user. Therefore we must conclude that skull-stripping (BET, WS), in general, should not be used prior to segmentation. The exception is, of course, the processing stream of FSL, where no skull-stripping prior to segmentation produces no feasible segmentation results. In this case, BET should be used.

Performance of FreeSurfer
The results of our analysis of FreeSurfer's accuracy and reliability have to be interpreted cautiously. FreeSurfer, in contrast to all other segmentation algorithms reported here, segments and reports gray matter volumes of structures as a whole. All other segmentation algorithms segment a 3D T1 MRI image voxel-wise into tissue classes. In FreeSurfer, the definition of the thalamus, for example, extends into the lateral thalamic nuclei, which have a rather heterogeneous tissue composition. The structure labeled as ''thalamus'' by FreeSurfer's segmentation algorithm may therefore contain both white matter and gray matter. This mechanism may be the reason for FreeSurfer's decreased segmentation accuracy and its high reliability. As FreeSurfer labels structures as a whole, the segmentation algorithm is not very sensitive to changes in image quality or noise, which in other algorithms, may lead to misclassifications of single voxels within structures. However, the high reliability caused by this mechanism may become problematic as it is accompanied by low accuracy. Thus, researchers need to decide first, whether they want to focus on identifying cerebral and subcortical structures or gray matter tissue (please see also [41]).

Limitations and Outlook
The results of the current study have their own limitations. We systematically tested a considerable number of different factors, which may influence segmentation quality, and examined two major measures of segmentation quality, namely accuracy and reliability. However, it is inappropriate to generalize towards every possible scan setting from only the results of our study. Most importantly, this study focused on data processing and was not designed to test technical factors during data acquisition, such as type of coil, impact of parallel imaging, acquisition protocol, or field strength. Future studies may also address regional differences in segmentation accuracy between different segmentation algorithms. Clark et al. [17] implicitly made the first attempt in this direction by calculating gray matter volumes of the major lobes of the brain separately, instead of comparing total gray matter volumes. The focus of further investigations should be to determine which brain segmentation algorithm is most accurate for which region of the brain, most importantly, which segmentation algorithm is best suited for the segmentation of cortical areas, and which algorithm provides the most accurate results for subcortical areas.

Conclusions
Our findings address crucial factors that influence the quality of gray matter segmentation. Additionally, our results provide guidance in designing state-of-the-art segmentation pathways optimized for individual software settings. Our study emphasizes that comparisons of the results of morphological studies using different segmentation algorithms should be made with great caution. In conclusion, our results suggest that researchers must be aware of the fact that the choice of the segmentation pathway used in a morphometric investigation can easily introduce a ''segmentereffect'' on the order of 2-3% variability in segmented gray matter volume. Researchers therefore need to optimize their scanning and processing procedure with respect to their individual settings. Before performing a study, the accuracy and reliability of a specific segmentation pathway has to be adequately determined to enable correct interpretation of the results.