Evaluation of machine learning algorithms and structural features for optimal MRI-based diagnostic prediction in psychosis

A relatively large number of studies have investigated the power of structural magnetic resonance imaging (sMRI) data to discriminate patients with schizophrenia from healthy controls. However, very few of them have also included patients with bipolar disorder, allowing the clinically relevant discrimination between both psychotic diagnostics. To assess the efficacy of sMRI data for diagnostic prediction in psychosis we objectively evaluated the discriminative power of a wide range of commonly used machine learning algorithms (ridge, lasso, elastic net and L0 norm regularized logistic regressions, a support vector classifier, regularized discriminant analysis, random forests and a Gaussian process classifier) on main sMRI features including grey and white matter voxel-based morphometry (VBM), vertex-based cortical thickness and volume, region of interest volumetric measures and wavelet-based morphometry (WBM) maps. All possible combinations of algorithms and data features were considered in pairwise classifications of matched samples of healthy controls (N = 127), patients with schizophrenia (N = 128) and patients with bipolar disorder (N = 128). Results show that the selection of feature type is important, with grey matter VBM (without data reduction) delivering the best diagnostic prediction rates (averaging over classifiers: schizophrenia vs. healthy 75%, bipolar disorder vs. healthy 63% and schizophrenia vs. bipolar disorder 62%) whereas algorithms usually yielded very similar results. Indeed, those grey matter VBM accuracy rates were not even improved by combining all feature types in a single prediction model. Further multi-class classifications considering the three groups simultaneously made evident a lack of predictive power for the bipolar group, probably due to its intermediate anatomical features, located between those observed in healthy controls and those found in patients with schizophrenia. Finally, we provide MRIPredict (https://www.nitrc.org/projects/mripredict/), a free tool for SPM, FSL and R, to easily carry out voxelwise predictions based on VBM images.


Introduction
Although the role of statistical methods in medical research has been historically dominated by inference, its use for prediction has become more relevant in recent years. In part, this shift in objectives has been allowed by the availability of large amounts of data together with the development of new computational tools that can deal with these large datasets [1]. Among other sources, structural magnetic resonance imaging (sMRI) data has been proposed as an input for clinical diagnosis and outcome prediction in different clinical areas [2].
Initially, due to the large extent of MRI datasets, intermediate steps aimed at reducing the number of predictor variables were required for computational feasibility. Such reduction could either involve a supervised step, where the researcher selected specific voxels or brain regions based on a priori information (i.e. feature selection), or an unsupervised procedure like a principal or independent component analysis [3]. In both cases, though, the risk of discarding relevant information was present. In recent years, however, optimized versions of commonly used classifiers which can be readily applied to MRI datasets without needing dimensionality reduction have been developed [4].
Studies evaluating the predictive power of sMRI images are particularly numerous in Alzheimer's disease prediction [5], psychiatric diagnosis [6,7] and in the assessment of brain tumor characteristics [8]. Still, it is difficult to extract reliable conclusions on optimal prediction procedures from individual studies as they usually evaluate the performance of specific algorithms on image sets that have been acquired and processed in particular ways, with only a small subset of studies systematically comparing the prediction capacity of available algorithms. While this comparison has been recently made for several pathologies including multiple sclerosis [9], fibromyalgia [10] and Alzheimer's disease [11,12] some other relevant clinical areas such as psychosis still lack a systematic evaluation.
Specifically, in the area of psychosis, where studies have traditionally focused on reporting statistically significant differences involving patients with schizophrenia and patients with bipolar disorder, there is a current interest in predicting the final diagnostic for patients undergoing a psychotic episode by means of these classifying algorithms. Most of the sMRI studies carried out so far, though, have mainly assessed the classification accuracy between patients with schizophrenia and controls [7], with only few evaluating the discriminative power of sMRI to separate patients with bipolar disorder from healthy subjects [13][14][15][16] and only one of them performing the most clinically relevant classification between bipolar and schizophrenic subjects [14].
Here, in order to objectively assess the utility of sMRI images in diagnostic prediction in psychosis, we systematically evaluate the performance of a large set of available machine learning algorithms (ridge, lasso, elastic net and L0 norm regularized logistic regressions, a support vector classifier, regularized discriminant analysis, random forests and a Gaussian process classifier) on some of the most commonly used sMRI data formats (grey and white matter voxelbased morphometry, vertex-based cortical thickness and volume, region of interest volumetric measures and wavelet-based morphometry maps). All possible combinations of algorithms and data formats are used to estimate the discriminability between well matched samples of healthy controls (N = 127), of patients with schizophrenia (N = 128) and of patients with bipolar disorder (N = 128). Furthermore, to maximize the predictive power of sMRI images, all different feature types are also combined in a single prediction model. Finally, several multi-class approaches are considered in order to evaluate the accuracy rates to be found in a simultaneous classification of the three groups. As detailed later, we provide as well MRIPredict, a free tool for SPM, FSL and R that allows an easy specification, validation and fitting of voxelwise models that can be later applied to new MRI datasets, even if they have different voxel dimensions (software available at https://www.nitrc.org/projects/mripredict/).

Material and methods Sample
A sample of N = 128 individuals with a diagnosis of schizophrenia according to DSM-IV criteria were recruited from Benito Menni CASM and Mare de Déu de la Mercè hospitals (Spain). All individuals were right handed, in the 18 to 65 age interval, with no history of brain trauma or neurological disease, and not having shown alcohol/substance abuse in the last 12 months. All patients but one were taking antipsychotic medication (atypical N = 82, typical N = 9, both N = 30, unknown N = 6, equivalents of Chlorpromazine: 824.0 mg (mean), 642.8 mg (sd)). Considering the same exclusion criteria, a second sample of N = 128 patients with a diagnose of type I bipolar disorder matched for age, gender and pre-morbid IQ, as estimated with the Word Accentuation Test [17] were recruited from the Benito Menni CASM and the Hospital Clínic de Barcelona (Spain). When scanned, 77 were in euthymia while 28 were undergoing a manic phase and 23 were under depression. 75 where taking antipsychotic medication (atypical N = 64, typical N = 4, both N = 7, equivalents of Chlorpromazine: 399.4 mg (mean), 388.0 mg (sd)), 105 where taking mood stabilizers and 33 antidepressants. Finally, a third sample of N = 127 healthy control individuals, matched by the same criteria was recruited from nonmedical hospital staff, their relatives and acquaintances, plus independent sources in the community. Apart from previous exclusion criteria, controls reporting a history of mental illness and/or treatment with psychotropic medication were discarded. Table 1 gives further demographic and clinical information on the three samples. All participants gave written informed consent and the study was approved by the Clinical Research Ethics Committee of the Sisters Hospitallers (Comité de Ética de Investigación Clínica de las Hermanas Hospitalarias).

sMRI data features
For each subject, a structural brain image was acquired with a 1.5-T GE Signa scanner (General Electric Medical Systems, Milwaukee, WI, USA) using the following acquisition parameters: T1-weighted sequence, 180 axial slices, 1mm slice thickness with no gap, 512×512 matrix size, 0.5×0.5×1mm 3 voxel resolution, 4ms echo time, 2000ms repetition time, 15˚flip angle. Once acquired, information contained in the T1 images was summarized in the following data features (see also Fig 1): 1. Cortical thickness of left and right hemispheres: sMRI data were analyzed with the FreeSurfer image analysis suite (http://surfer.nmr.mgh.harvard.edu/). Briefly, the pre-processing included removal of non-brain tissue, automated Talairach transformation, tessellation of the grey and white matter boundaries and surface deformation [18]. A number of deformation procedures were performed in the data analysis pipeline, including surface inflation and registration to a spherical atlas. Intensity and continuity information from the entire three dimensional images in the segmentation and deformation procedures were used to produce vertex-wise representations of cortical thickness (CT) in each vertex across the cortical mantle.
2. Cortical volume of left and right hemispheres: In addition to CT, the FreeSurfer also computes vertex-wise cortical surface area (SA). Both CT and SA are multiplied to obtain a vertex-wise representation of cortical volume (CV). All individual CT and CV maps were smoothed using a Gaussian kernel with full width at half maximum (FWHM) of 30 mm [19].
3. Grey and White matter voxel based morphometry (VBM) images: Structural images were segmented into grey and white matter partial volume images in the native space using the unified segmentation algorithm included in SPM12 [20]. Then, the original structural images were brain-extracted [21] and aligned to the Montreal Neurological Institute MNI152 2mm standard template using FSL registration tools [22]. The resulting deformation fields were applied to the initially segmented images to obtain grey and white matter normalized images. To reduce computational cost, those images were subsampled to a 4 x 4 x 4 mm resolution.
4. Grey and White matter wavelet based morphometry (WBM) images: Taking the grey and white matter normalized VBM images as inputs, we applied the methodology explained in [23] and implemented in the WBM toolbox (http://www.wbmorphometry.com/). Initially, input images were smoothed with a Gaussian kernel (FWHM%7 mm) and were transformed to the wavelet-domain using a 3D discrete orthogonal wavelet transform based on symmetric spline wavelets with degree n = 3 and resolution level J = 2. By means of the minimum description length procedure, coefficients that best represented grey and white matter anatomy on all subjects were retained for the classifications.

Region of interest (ROI) based brain volumes and their interactions: The FreeSurfer was
used to parcellate the brain parenchyma in cortical and subcortical ROIs [24]. Mean volume values for these ROIs were extracted and used together with cerebellum, white matter and ventricle volumes as independent variables for classification. In addition, after standardizing their values, we calculated the pairwise products between all volumes as modelers of pairwise interaction. This extended set of variables was also supplied to the classifiers together with the original regional volumes.
6. Joint dataset combining all previous data features: We evaluated the potential improvement in classification accuracy achieved by merging data from all feature types in a single matrix. The amount of independent variables involved, however, made the direct implementation of algorithms computationally unfeasible. To reduce the number of variables in a meaningful way we implemented two different strategies: (a) following Wang et al. [25], we applied a previous dimensionality reduction through a principal component analysis (PCA), and (b) by considering a similar approach than in Dai et al. [26], we calculated the univariate tvalues from pairwise group comparisons, selecting only the 1% of variables with highest tscores (in absolute value).

Learning algorithms
Eight classifiers, selected for their habitual usage and their computational efficiency, were applied to the different data features described in the previous section. Prediction capability of each data feature-classifier pair on the three possible classifications involving the two groups of patients and controls was quantified. Specifically, the algorithms evaluated were: (I) ridge and (II) lasso logistic regressions [27], (III) elastic net regularization [28], (IV) L0-norm regularization [29], (V) a support vector classifier (SVC) [4], (VI) regularized discriminant function analysis (RDA) [30], (VII) a Gaussian process classifier (GPC) [31], and (VIII) Random forests (RF). A theoretical overview of these algorithms together with technical details on their implementation can be found in the S1 Appendix (Description of learning algorithms).

General procedure and cross validation scheme
To have a non-biased assessment of the performance we applied the classifiers, which had been previously built on training samples, on a completely independent group of individuals (i.e. a test sample). A 10-fold cross validation scheme was followed to divide the original sample (made of all individuals belonging to the two groups) in 10 non-overlapping partitions [4]. For each partition individuals included were considered as the test sample, and the remaining individuals as the training sample. A graphical representation of the general procedure for evaluating classification accuracy is given in Fig 2. For all classifiers but RF and GPC cross-validation is used at two levels: at an outer level the complete sample is divided in 10 parts for training and testing, but within each training sample a second internal cross-validation is usually carried out to select the optimal values for the regularization parameters. From a range of parameter values, those minimizing the classification error in this internal cross-validation are used to build de classifier, which later is applied to the test data to have an objective assessment of classification accuracy. This procedure is repeated 10 times (for each of the 10-fold partitions) generating 10 accuracy estimates.
To avoid over-optimistic results the effect of nuisance covariates on test data should be regressed out by using those coefficients fitted in the training data (i.e. test data should not be used in the fitting of nuisance covariates) (see Fig 2). Individual performances of each algorithm-feature combination are given as frequencies of test individuals successfully classified (assuming a p(X) > 0.5 threshold) and by other quantities such as the area under the (receiver operating) curve (AUC) [4]. The receiver operating curve (ROC), which is based on the relative performances considering the whole range of possible probability thresholds (from 0 to 1) has an area that ranges from 0.5 for classifiers without any prediction capability to 1 for perfectly classifying algorithms.

Multi-class classifiers
Although many of the learning algorithms used here were initially designed for two group classifications, extensions have been built to deal with more than two groups simultaneously. Here we have applied three different approaches for simultaneous classification of the three groups.
On the one hand, after performing the three possible pairwise classifications (among our three groups) we have assigned each test individual to the class with highest mean probability (i.e. a one-versus-one classification approach [4]). Alternatively, we have carried out classifications between each class and a merged class containing subjects from the two remaining classes, assigning test individuals to the non-merged class with highest probability (i.e. a one-versus-all approach [4]) and, finally, for those classifiers with inbuilt multiclass functionality (all but the L0-norm and SVC) we have used the methods available. These involved the regularized multinomial regression (for the ridge, lasso and elastic net), the multi-class regularized discriminant analysis, and the multi-class versions of the GPC and RF. It should be noted that in all three-class classifications we had a 0.333 probability of assigning, by chance, the individual to the correct class. Indeed, when accuracy rates are averaged over classifiers GM-WBM, and in a higher degree GM-VBM, significantly outperform most of the other feature types in the healthy vs. schizophrenia and in the healthy vs. bipolar classifications (Fig 6). However, for the bipolar vs. schizophrenia classification this trend is less clear. In contrast, when classification rates are averaged over features and algorithms are compared, no single classifier outperforms the others (Fig 7), and a poor performance of the L0-norm classifier is the only distinctive and Classification levels achieved by the different algorithms when applied to grey matter VBM (the best performing feature type) are shown in Table 2. Best rates were attained in the healthy vs. schizophrenia classifications, with accuracies as high as 0.77 for the SVC, although no classifier significantly outperformed any other (Wilcoxon paired test at p < 0.05) (average accuracy over all classifiers equaled 0.75) while misclassification between healthy individuals and individuals with bipolar disorder was higher (averaged accuracy declined to 0.63) but again no single classifier outperformed any other classifier. Finally, the classification between both psychiatric disorders reported similar classification levels with a mean accuracy of 0.62, although here the ridge regression algorithm (with mean accuracy of 0.66) significantly outperformed the L0-norm classifier (mean accuracy of 0.58); Wilcoxon paired test p = 0.035. However, no other comparison was significant.  Optimal MRI-based diagnostic prediction in psychosis Table 2) which declined to 0.69 for the healthy vs. bipolar disorder classification and to 0.68 in the classification between both disorders. A bootstrap based statistical test comparing the AUC between classifiers reported very few significant differences between algorithm performances. These only included a higher AUC for the SVC versus the elastic net regression (p = 0.005) and versus the GPC (p = 0.009) in the healthy vs. schizophrenia classification.

Results
To gain some insight on the inner functioning of algorithms applied to grey matter VBM data, maps of fitted coefficients and weights were obtained for most of the classifiers (see Fig  9). In Fig 9 coefficient maps are drawn together with maps of effect sizes, which were derived from standard univariate t-tests applied to each voxel (i.e. the standard method for generating maps of differences in group comparisons). Although the aspect of coefficient maps clearly differed among classifiers, there was a broad agreement between most prominent patterns and features in the effect size maps. A more quantitative view of such agreement is provided by plots of Fig 10 where, in most cases, a monotonic increasing relationship is shown between effect size and coefficient value. In those cases where this relationship was not clear (the lasso in controls vs. schizophrenia and all RF classifications) largest coefficients were still linked to voxels with largest effect sizes. Values of RF though, are not model coefficients but variable importance measures derived from the Gini index [4]. This agreement between coefficients and effect sizes links classifiers with likely anatomical group divergences. When all information from the different data features was combined together, and after applying PCA for dimensionality reduction, classifiers reported accuracies clearly lower than those achieved only with grey matter VBM (without dimensionality reduction) (Fig 11A). And although most mean accuracies were higher than 0.5, bootstrap intervals revealed that for the two classifications involving bipolar subjects many of these were not significantly different from 0.5 (see Table 3). In contrast, when the top 1% of variables with largest t-values were considered, accuracies achieved levels very similar to those provided by grey matter VBM (see Fig  11B), and in all cases they were considered significantly larger than 0.5 (see bootstrap intervals in Table 3). In any case, however, performances were higher than those provided by grey matter VBM.
Accuracies from one-versus-one multi-class classifications on grey matter VBM were, in general, lower than those delivered by pairwise classifications (see Fig 12). Furthermore, although significant predictive power was still found for controls (with an average accuracy of 60%) and for schizophrenia (with an average accuracy of 57%) classification rates for bipolar patients (with an average accuracy of 37%) were quite close to the 33% expected by chance. Indeed, most classifiers included the 33% inside the bootstrap confidence intervals (see Table 4) suggesting that multi-class algorithms do not classify bipolar patients reliably. Results from the other two multi-class schemes (one-versus-all and inbuilt multiclass) delivered similar levels of accuracy than those of the one-versus-one design (see Tables 4 and 5). While mean overall accuracy was 51% for the one-versus-one approach a value of 52% was attained for both one-versus-all and inbuilt approaches. Again, all classifiers showed a significant predictive power for the control group (average of 64% for both schemes) and schizophrenia group (63% and 60%) but no reliable prediction power was found for the bipolar group (average accuracy of 30% and 32% respectively).

Discussion and conclusions
After applying the eight classifiers on the different feature types we can outline some general conclusions. First, it seems that while the election of the feature type may be of relevance to Optimal MRI-based diagnostic prediction in psychosis achieve an optimal classification, the choice of classifier is not important. Most classifiers provide similar levels of accuracy when an adequate feature type is selected. Specifically, for the three pairwise classifications carried out here with patients with psychosis, grey matter VBM and, to a minor extent, grey matter WBM are the feature types leading to highest accuracies. For them no single classifier clearly outperforms the others. This is rather surprising since, although it has been recently proven that some of the applied classifiers have clear mathematical similarities [4] it is also clear that some of them are unmistakably different. The most obvious case being the Random forest classifier, which binarizes continuous variables by partitioning the feature space, and is not constrained by the additivity found in logistic regressions and support vector classifiers. This same result, though, has been previously reported by Khondoker et al. [32] whom, in a classification involving patients with Alzheimer and controls, Lower and upper limits for the 95% confidence intervals generated by bootstrap are also reported for these two quantities. Optimal MRI-based diagnostic prediction in psychosis showed that as effect size (i.e. the real discriminative power of data) increased different classifiers tended to achieve similar levels of classification accuracy, making the choice of algorithm less relevant. A distribution of observations in the multidimensional feature space largely following an unstructured pattern could be a plausible explanation for our results. Such a distribution with unstructured noise would not be better classified by any complex function than a hyperplane, which is a geometrical feature that all classifiers are to a large extent able to generate, and this would eventually lead to similar classification accuracies. We have also seen in Fig  9 that, in spite of working differently, classifiers give largest weights to voxels located in the same or similar areas, extracting and using similar information from the VBM images. As well, there are reasonable explanations for the best classifying performance of grey matter VBM. First, while substantial white matter abnormalities have been described in both schizophrenic and bipolar patients through diffusion MRI [33,34], such patterns have not been as clear in the few VBM studies analyzing white matter, at least in schizophrenia [35]. On the other hand, lower accuracies delivered by region of interest measures are attributable to the intrinsic loss of information caused by spatial averaging of high resolution data. Finally, the poorer performance of both vertex based cortical features may be related to their restricted spatial extent, which excludes all subcortical structures. In addition, major structural abnormalities in schizophrenia and bipolar disorder have been described in the medial frontal cortex and the insulas [35,36] which are regions with high topological complexity.
It should also be noted that the primary objective of this study was the comparison of commonly used classifiers and feature types for classification in psychosis, intentionally leaving   many other existing classifiers, subtypes and variants untested. Neither it was of interest to attain particularly high accuracies. Indeed, when performances in our study are compared to those found in other schizophrenia vs. healthy classifications reported in the recent revision by Wolfers et al. [7], they are average. The mean accuracy rate found for the bipolar vs. control classification based on the VBM data (63%) is also similar to values reported by the few studies analyzing the same classification on sMRI: 60% [14], 66% [15], 73% [16] with the exception of Bansal et al. [13] that achieved a surprisingly high classification accuracy of 98%. Finally, the only study directly classifying bipolar vs. schizophrenic patients [14] reported a classification accuracy of 88%, which is clearly higher than ours (62%). However, when they applied the fitted classifiers to an external sample, their classification accuracy descended to 65%.
Yet, most previous studies used smaller sample sizes, sometimes significantly smaller than ours, making their results less reliable. In our study, we have used large and well balanced samples and we have paid special attention in keeping the independence between test and training sets throughout all the image processing steps in order to avoid unintended biases and overoptimistic classification estimates. Also, when running the different classifiers we have noticed the relevance of carefully choosing the range of possible parameter values in the training phase which, if ignored, would lead to clearly suboptimal classification rates. Since our training samples had (nearly) equal number of individuals, classification rates assumed equal prior probabilities (of 0.5) for all classes. In real situations, though, this equality will sometimes not be met, and when using other priors, accuracy rates will be different from those reported here.
Similarities observed between effect sizes and classifier coefficients relate the later with apparent anatomical divergences, bringing some insight on the way classifiers use information from the images. However, such relation will hold true only if effect sizes contain patterns of real abnormality. Indeed, for both group pairs involving controls and patients we have found the highest effect sizes in areas, like the insulas and the medial frontal cortex, which have consistently reported as having grey matter reductions in VBM meta-analyses of both psychotic Optimal MRI-based diagnostic prediction in psychosis disorders [35,36]. Still, a close agreement between effect sizes and fitted coefficients should not be expected as the former simply report univariate between group dissimilarities while the later are weights from multivariate predictive models that, in some cases (e.g. Random Forests) have a very complex nature. Also, in settings with many more features than cases and with high levels of spatial autocorrelation (as it occurs in sMRI images), sparse classifiers like the Lasso or the L0-norm may lead to an extremely large number of competing models having optimized prediction capabilities [4].
The decrease in classification accuracies observed when using combined features and PCA reduction was unexpected. In contrast to results in [25], merging information from different feature types did not bring any improvement. But unlike Wang et al. [25] which combined Optimal MRI-based diagnostic prediction in psychosis different MRI modalities (sMRI and resting state functional MRI), we have derived all features from the same T1 images (expecting higher levels of redundancy between data features). Furthermore, dimensionality reduction through principal components does not seem to have retained the most relevant information, as grey mater VBM clearly provided better classifying accuracies. In contrast, feature selection based on the t statistic has clearly been more successful in retaining the relevant information from the combined features, although grey mater VBM classification rates have not been surpassed by this approach.
Reductions found in multi-class classifications are easily explained by the presence of a third competing class in each classification. Here, the feature space should be divided in three excluding areas by the algorithm, thus increasing the probability of misclassification. Such effect is particularly noticeable in the bipolar disorder group, where classification levels do not depart significantly from what would be expected by chance (33%). This is likely due to the fact that, as made evident by the effect size maps of Fig 9, VBM intensities in bipolar patients tend to be located between those observed in controls and in patients with schizophrenia (i.e. patterns of abnormality in bipolar disorder are similar to those in schizophrenia but less intense). Such intermediate position between two competing classes has probably led to the higher misclassification rates observed in this clinical group. This result seems to be quite Classification accuracies generated by multi-class classifiers on grey matter VBM using the one-vs-one approach. All algorithms were used (except the regularized discriminant function analysis, which did not report reliable class probabilites). Overall accuracies are plotted together with accuracies for the three groups separately. Green line: mean accuracy for the 10 test samples; blue lines: approximate 95% confidence intervals for the mean accuracy; red line: highest and lowest accuracy values. ridge: Ridge regression, lasso: Lasso regression, elastic: Elastic net regularization, L0-norm: L0-norm regularization, SVC: Support vector classifier, GPC: Gaussian process classifier, RF: Random forest. https://doi.org/10.1371/journal.pone.0175683.g012 Optimal MRI-based diagnostic prediction in psychosis consistent as it has been replicated by the three multi-class schemes applied (the one-vs-one, the one-vs-all and the inbuilt multi-class approach) which have delivered similar correct classification rates. In any case, the lower accuracies observed in bipolar patients have practical implications for sMRI based classification in psychosis. Further lines of research include the optimal combination of the different classifiers to increase the currently reported accuracies. The inclusion of data features derived from other MRI modalities such as functional connectivity maps or diffusion based measures including fractional anisotropy and mean diffusivity may also allow achieving higher classification accuracies. Table 4. Mean accuracies obtained by all classifiers (apart from the regularized discriminant function analysis) using a one-vs-one and a one-vsall multi-class approach on grey matter VBM images. Lower and upper limits for the 95% bootstrap confidence intervals are also reported. 0.333 is the expected accuracy when no real predictive power is present.

Algorithm
One-versus-one Finally, as an added feature of this study we also provide MRIPredict, a free tool for SPM, FSL and R that allows an easy specification of the MRI datasets, of confounds and covariates, of cross-validation parameters and of voxelwise models to be fit (this software is available at https://www.nitrc.org/projects/mripredict/). MRIPredict applies regularized logistic regression from the Glmnet library [37] and saves the models in MNI space, thus allowing a later application to new scans from other sites, even if they have different voxel dimensions. It must be noted, then, that the accuracy of the new predictions may be limited if the new scans show important methodological differences with the scans used to fit the model.
In summary, from our exhaustive analysis of algorithms and data features we conclude that while grey matter VBM is the feature of choice for sMRI based classification in psychosis, the selection of classifier is not relevant (most have similar performance levels). We also conclude that the combination of different features types (derived from the same T1 images) do not seem to increase classification accuracies over classification rates achieved by grey matter Table 5. Mean accuracies obtained by classifiers that provide inbuilt multiclass functionality (all but the L0-norm and the support vector classifiers). Lower and upper limits for the 95% bootstrap confidence intervals are also reported. 0.333 is the expected accuracy when no real predictive power is present.

Algorithm
Mean accuracy 5%limit 95%limit Optimal MRI-based diagnostic prediction in psychosis VBM. Finally, multi-class classifications considering the three groups simultaneously have made evident a lack of predictive power for the bipolar group. This is probably due to its intermediate anatomical features, located between those observed in healthy controls and those found in patients with schizophrenia. We provide a new software tool that we hope will help many researchers conduct optimized voxelwise predictions.