Prediction of clinical depression scores and detection of changes in whole-brain using resting-state functional MRI data with partial least squares regression

In diagnostic applications of statistical machine learning methods to brain imaging data, common problems include data high-dimensionality and co-linearity, which often cause over-fitting and instability. To overcome these problems, we applied partial least squares (PLS) regression to resting-state functional magnetic resonance imaging (rs-fMRI) data, creating a low-dimensional representation that relates symptoms to brain activity and that predicts clinical measures. Our experimental results, based upon data from clinically depressed patients and healthy controls, demonstrated that PLS and its kernel variants provided significantly better prediction of clinical measures than ordinary linear regression. Subsequent classification using predicted clinical scores distinguished depressed patients from healthy controls with 80% accuracy. Moreover, loading vectors for latent variables enabled us to identify brain regions relevant to depression, including the default mode network, the right superior frontal gyrus, and the superior motor area.


Introduction
Advances in analyzing large datasets with machine learning algorithms promote their application in medical diagnosis. In particular, their use in objective diagnosis of psychiatric disorders using brain imaging and other biological data is now being actively studied [1]. A major challenge in applying statistical machine learning algorithms to brain imaging or genetic data is the high dimensionality of the input variables, such as the number of voxels and the number of possible genetic polymorphisms. Even though algorithms such as support vector machine (SVM) and L1-regularized classifiers (LASSO) manage the issue of high-dimensionality, the problem of co-linearity in brain imaging data remains. Neural activities in nearby voxels or in the same functional network are highly correlated, which makes the results of commonly used Subjects 58 patients (age 26-73, average 42.8 ± 11.9, 33 female) with major depression disorder were recruited by the Psychiatry Department of Hiroshima University and collaborating medical institutions, based on the Mini-international neuropsychiatric interview (M.I.N.I [21]), which enables doctors to identify psychiatric disorders, according to the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition (DSM-IV [22]). As a healthy control group, 65 subjects (ages 20-66, average 34.8 ± 13.0, 28 female) with no history of mental or neurological disease, were recruited via advertisements in local newspapers.

Clinical measures
The following interview-and questionnaire-based measures are used for determination of disease presence and quantification of the severity of two primary symptoms we wish to predict, namely, anhedonia (loss of motivation, loss of pleasure, etc.) and negative mood (low mood, guilty feelings, suicidal thoughts, etc.).

Beck Depression Inventory II (BDI-II).
This measure evaluates the presence and severity of depression based on a self-report questionnaire [15]. Subjects are asked to answer 21 questions about feelings of punishment or guilt, suicidal thoughts, etc. Each answer is scored with a value between 0 and 3, with 3 being the most serious. High scores indicate severe symptoms. The standardized score of !14 indicates that a subject is suffering from depression.
Snaith-Hamilton Pleasure Scale (SHAPS). This measure was developed to evaluate the level of anhedonia [17]. Subjects are asked to answer 14 questions about hedonic capacity, with scores between 1 and 4. High scores indicate more severe anhedonia.
Positive and Negative Affect Schedule (PANAS). This widely used measure evaluates positive and negative moods of subjects [18,23]. In this study, we considered only scores related to negative mood items. This measure is generally known as PANAS(n). Subjects are asked to respond to 10 questions about their moods, with answers between 0 and 5. The sum of all scores indicates the strength of their negative moods. Due to an evaluation issue, one subject's response could not be assessed, so that score was replaced with the mean of the remaining subjects. Table 1 summarizes scores exhibited for each measure by each group in our study [20]. Although most patients showed both anhedonia and negative mood, some exhibited only one trait. Correspondingly, the scores of the BDI-II, SHAPS, and PANAS(n) are highly, but not completely correlated. As decreased mental function results from aging, the age of the subjects is expected to correlate with BDI-II, SHAPS, and PANAS(n) as well.
We verified these correlations by calculating the correlation coefficients (Table 2) [20]. Strong correlations between clinical measures are reflected in coefficients above 0.7. Weaker correlations between age and individual clinical measures were around 0.3. In our regression model, BDI-II, SHAPS, PANAS(n), and age of each subject are considered as responses in order to correct for their natural correlation, resulting from functional connectivity. We will show that the introduction of subject age as an output rather than as an input is beneficial with respect to classification accuracy.

Functional connectivity of resting-state fMRI
Functional MRI measurements were acquired on a 3T GE Signa HDx scanner with a 2D EP/ GR (TR = 3s, TE = 27ms, FA = 90deg, matrix size 64x64x32, voxel size 4x4x4 mm, no gap, interleaved). Subjects were instructed to lie with their eyes open, to think of nothing in particular, and to remain awake. They are also instructed to refrain from taking caffeine, nicotine, and alcohol in the day of experiment. For each subject, acquired images were processed with SPM8 (Wellcome Trust Centre for Neuroimaging, UCL, London) following standard procedures. We first perform slice timing correction, motion correction, co-registration to anatomical MRI, normalization with standard brain and smoothing (Gaussian of full-width at half-maximum 8mm). We confirmed that there were no significant differences in six motion parameters between two diagnostic groups in order to reject a possible effect of spurious functional connectivity due to head motion [24,25]. Voxels were assigned to 116 brain regions, according to the automatic anatomical labeling atlas (AAL) [26]. Mean activation time series in each brain region were obtained by averaging MRI signal time series over all voxels assigned to each region. Finally, functional connectivity between each pair of regions was computed as the cross correlation of the corresponding time-series.

Methods
Partial least squares (PLS) regression is a method for modeling a relationship between two sets of multivariate data via a latent space, and of performing least squares regression in that space. PLS can handle high-dimensional co-linear datasets because of its underlying assumption that the two datasets are generated by a small number of latent components. In this process, latent components are formed by maximizing the covariance between the two datasets.

Partial Least Squares Regression (PLS)
PLS models a linear relationship between two blocks of variables fx i g n i¼1 2 R p and fy i g n i¼1 2 R q . In the following parts, X = (x 1 , . . ., x n ) T represents the (n × p) predictor matrix and Y = (y 1 , . . ., y n ) T represents the (n × q) response matrix. This procedure obtains L latent components as ft i g L i¼1 and fu i g L i¼1 . This technique assumes following decomposition: where both T = (t 1 , . . ., t L ) and U = (u 1 , . . ., u L ) are the (n × L) matrices of L latent components corresponding to X and Y, respectively. The (p × L) matrix P and the (q × L) matrix Q are loadings and the (n × p) matrix F x and the (n × q) matrix F y are the matrices of residuals. Since our objective is to perform least squares regression in a low-dimensional latent space, the underlying assumption is that the latent component u i can be well predicted from t i from a relation such as: where D is the (L × L) matrix. We need to maximize the covariance between t i and u i to satisfy the above assumption. Our objective criterion is where w 2 R p and c 2 R q are weight vectors for projection into the latent components. After extracting the latent component, the observation matrices X and Y are deflated by subtracting their rank-one approximation. It is important to stress the asymmetry scheme, i.e. that Y is deflated based on t, in the case of regression. By repeating the above procedures L times, we obtain the weight matrices W = (w 1 , . . ., w L ) and C = (c 1 , . . ., c L ).
Finally, the relation in the original data space is expressed by where B is the (p × q) matrix of regression coefficients and E is the (n × q) matrix of residuals. Plugging the relationship B = W(P T W) −1 C T [27,28] into Eq (5), we obtain a different representation of Y as:Ŷ The final transformation is derived from the following equalities [29], Note that t T i t j ¼ d ij (the Kronecher delta) takes the values 1 for i = j and 0 for i 6 ¼ j as a consequence of the algorithm.
In general, B is obtained from a centered training dataset. The response y new for a new subject x new , referred to as test dataset, is then estimated as follows: where " y and " x represent the mean predictor and response in the training dataset, respectively. A schematic outline of PLS is illustrated in Fig 1 and S1 Appendix.

Kernel Partial Least Squares Regression (KPLS)
Linear PLS is easily extended to nonlinear regression using a kernel trick [28,30].
Let 0 : R p ! H be a nonlinear transformation of the predictor, x 2 R p , into a feature vector, 0ðxÞ 2 H, where H is a high-dimensional feature space. Define a Gram matrix K as inner products of points in feature space, i.e., K = FF T , where F = (f(x 1 ), . . ., f(x n )) T represents the predictor matrix in feature space. In general, the number of columns of F is so large that with the explicit form of F, we can not perform the same procedure as in the linear case. However, due to the kernel trick, the explicit form of F becomes unnecessary.
The deflation procedure is performed as follows: where I n represents an n-dimensional identity matrix. We obtain the prediction on the training data from: To exclude the bias term, we assume that the responses and the predictors are set to have zero mean in the feature space by applying the following procedure to test kernel K t and training kernel K [31]: where 1 n represents the n-length vector whose n elements are 1. Note that n and n t represent the number of training and test samples, respectively.
In the following section of this paper, we investigate three kernel functions: 1) a second order polynomial kernel k(x, , referred to as KPLS-Gauss, where γ is a hyper parameter and set to the inverse of the median of the Euclidian distance of data points.

Classification
In addition to predicting clinical measures, our aim is to classify subjects into depressed patients and healthy controls using the predicted value of clinical measures for objective diagnosis. We evaluate generalization of binary classifiers using linear discriminant analysis (LDA). Given the training data D tr ¼ fx tr ; y tr ; z tr g and test data D te ¼ fx te ; y te ; z te g, x 2 R p , y 2 R q , and z 2 {0, 1} represent functional connectivity as predictors, clinical measures as responses, and binary labels (i.e. 0 is patients and 1 is healthy controls), respectively. In the prediction phase, our objective is to learn the function f B : R p ! R q , which, given predictors, x tr , and responses, y tr , assigns predictors to the most probable values of y. The prediction on the training dataset isŷ tr ¼ f B ðx tr Þ. In the next classification phase, our objective is to learn the classifier g w : R q ! f0; 1g, which, given predicted responses,ŷ tr , and binary labels, z tr , assigns predicted responses to the most probable labels. Assigned labels on the test dataset are obtained asẑ te ¼ g w ðŷ te Þ ¼ g w ðf B ðx te ÞÞ. It is important to stress that the binary classifier is not trained on actual clinical measures, y tr , but on predicted values ofŷ tr .
In a previous study [13], the authors only identified the binary classifier g 0 w : R p ! 0; 1, which, given functional connectivity, x tr , and binary labels, z tr , assigns functional connectivity directly to binary labels. By exploiting the predicted result of clinical measures, it may be possible to improve classification performance. We compared two scenarios, i.e. i) classification of patients and healthy controls using LDA from predicted clinical measures with KPLS (with KPLS-Gauss, KPLS-Poly(3), and KPLS-Poly(2)), PLS, and ordinary least squares regression (OLS), ii) classification of patients and healthy controls by means of LDA and SVM from functional connectivity directly. Note that we perform feature selection before scenario 2) by calculating connection-wise t-tests to determine the connections with different group means, represented by t-scores. We select the M functional connections with the highest absolute tscores. M is optimized by cross validations.

Pre-screening
Even though PLS can cope with high-dimensional, co-linear datasets, we prescreened variables depending on their relevance to responses in the following way.
Based on Pearson correlation coefficients, ρ rl , between the r-th functional connection and the l-th clinical measures, we define the empirical relevance of the r-th functional connection as where p is the total number of functional connections. These functional connections are ranked according to their empirical relevance, fR r g p r¼1 , and only M relevant functional connections are used in following procedure. The optimal number for M was determined through nested leave-one-out cross-validation.

Nested leave-one-out cross validation
Conventionally, cross validation is employed to assure generalization ability of a model or to evaluate optimal parameters. Since we have to account for both generalization ability and parameter optimization, we made use of nested leave-one-out cross validation (LOOCV), which consisted of outer and inner LOOCV. The outer LOOCV repeats iterations that split the whole set of samples into a single outer validation sample used to evaluate the generalization ability, and an outer training set for model estimation. The inner loop of LOOCV is performed on the outer training set to optimize two parameters, M and L, the number of selected predictor variables and the number of components, respectively. The pair of parameters that achieves the lowest root mean squared error based on the inner validation sample are adopted as optimal parameters and used to evaluate the model using the outer LOOCV. These steps are repeated until each sample has served as the validation sample.

Age
Age is significantly correlated with three clinical measures ( Table 2). In general, age matching performed on different diagnostic groups reduces sample size, causing poor performance. To avoid this problem, we investigated three models, i.e. (i) a model with age as a response (denoted by output-age), (ii) a model with age as a predictor (denoted by input-age), and (iii) a model without age (denoted by no-age). By incorporating age into our model, we can cope with age differences among subjects and can fairly evaluate prediction performance.

Interpretation
Interpretation of each latent component projected from input and output data gives novel insights into the relationship between functional connectivity and clinical measures. In the framework of PLS, loading matrices, P and C, indicate contributions from predictor variables and response variables to each latent component (see Eqs (10) and (11)). The (i, j)-element of the loading matrix, P, represents the contribution of the i-th functional connection to the j-th latent component. Similarly, the (i, j)-element of the loading matrix, C, represents the contribution of the i-th clinical measure to the j-th latent component. Note that due to subject variability, values of P ij and C ij vary depending on the training set used.

Regression performance
We compared the prediction performance of PLS, its kernel variants, and other methods by means of the root mean squared error (RMSE) of the predicted clinical measures in nested leave-one-out cross validation (see Methods). Kernel PLS with a second-order polynomial kernel (KPLS-Poly(2)) achieved the lowest RMSE (9.56 for BDI-II, 6.11 for SHAPS, and 7.29 for PANAS(n)) (Fig 2). This performance was significantly better than that of ordinary least squares regression (OLS) (11.6 for BDI-II, 7.33 for SHAPS, and 8.91 for PANAS(n)) and comparable to that of other variants of PLS applied in our study, suggesting that projection of data into a low-dimensional space was beneficial to regression performance. All statistical comparisons were adjusted for multiplicity using the Bonferroni-Holm method with significance level, α = 0.05.
Next, to evaluate the best way of incorporating age into our regression models, we compared RMSE of the output-age, input-age, and no-age models. In our study, incorporating age into our regression model as a response (output-age) achieved significantly better performance than that of the input-age and no-age models (Fig 3). The details were listed in Supporting Information (S1, S2 and S3 Tables). All statistical comparisons were adjusted for multiplicity using the Bonferroni-Holm method with significance level, α = 0.05.
The correlation coefficient of actual and predicted values for BDI-II, SHAPS, and PANAS (n) in the case of KPLS-Poly(2) were r = 0.541, 0.591, 0.563, respectively. The relationship between predicted and actual values of BDI-II for KPLS-Poly(2) was exemplified (Fig 4). This   result was comparable to that of   [14]; however, the number of subjects in our study was larger than in theirs, reconfirming validity of the results.
The optimal number of retained features M Ã identified by pre-screening and using the latent component L Ã identified with nested LOOCV were 40 and 3, respectively, suggesting that reduction of feature size was relevant for improvement of PLS prediction accuracy.

Classification performance
Projecting the original data onto a low-dimensional space was expected to improve classification accuracy. To verify the benefit of projection, several classification methods were performed and evaluated using accuracy, sensitivity, and specificity (Fig 5). The details were listed in Supporting Information (S4 and S5 Tables). In our study, KPLS-Poly(2) followed by LDA achieved the best accuracy 80.5% (sensitivity 81.0% and specificity 80.0%), which is significantly better than the 57.7% accuracy of direct LDA (sensitivity 53.4%, and specificity 61.5%) and 69.1% accuracy of direct SVM (sensitivity 69.0%, and specificity 69.2%). This result indicates that it was beneficial to exploit the prediction model for clinical measures in order to build a classification model. In addition, KPLS-Poly(2) followed by LDA also achieved significantly better accuracy than the 62.6% accuracy of OLS followed by LDA (sensitivity 62.1% and specificity 63.1%), indicating that considering a latent space in a regression model was beneficial to final classification. Accuracy did not differ significantly between PLS and kernel variants. All statistical tests were based on approximation with the normal and adjusted for multiplicity using the Bonferroni-Holm method with significance level α = 0.05. Prediction of clinical depression scores Interpretation In our study, three clinical scores showed almost equally positive influences on the first component, and age also had a positive influence as well. However, age showed a strong negative influence on the second component, in contrast to the clinical scores (Fig 6).
Latent space representation of subjects showed that the first component explained most depression severity in comparison with the second component (Fig 7). This is consistent with the results of loading matrix C. Note that since the optimal number of latent components, in terms of minimizing regression error, was 3, the second and the third components are thought to contain some information about scores.
In order to validate the effect of age, especially in the second component, all subjects were grouped into young (age 20-31, 41 subjects), middle (age 31-43, 41 subjects), and old (age 44-73, 41 subjects) groups. Note we simply divided the subjects in three equal-sized groups for convenience, "young", "middle", and "old". They are relative, not absolute age classes. Latent variables of old subjects in the second component were significantly lower than those of young Prediction of clinical depression scores and middle subjects (p < 10 −5 by Wilcoxon Rank-Sum Test), suggesting that old patients have distinctive patterns in the second latent space [5] (Fig 8).
Evaluation of loading matrix, P, reveals functional connections relevant to each latent component. Especially, the first column of P, corresponding to the first component responsible for discrimination of each diagnostic group, is expected to reveal useful insights about the effect of functional connections on depression symptoms. Even though the performance of KPLS-Poly (2) in prediction and classification was comparable to or better than that of linear PLS, patterns of significant loadings were consistent in our experiments. For reasons of interpretation, we therefore focus on the loading matrix of the linear terms in following sections.
BrainNet Viewer [32] (http://www.nitrc.org/projects/bnv/) was used to visualize the top 10 connections with positive and negative loadings for the first component (Figs 9 and 10). In this figure, many regions involved in the default mode network (DMN), as well as the left supplementary motor area, the right superior frontal gyrus, and the insula, were relevant. In addition, some functional connectivity between the right cuneus and regions involved in the cerebellum were negatively correlated with the first component. MacIntosh et al. (1996) first introduced PLS analysis into the field of neuroimaging in order to extract common information between brain activity and exogenous information, such as experimental or behavioral data [3,4]. In particular, behavioral data are increasingly used to extract associated brain activity patterns for various types of psychological diseases, such as Alzheimer's disease [33], obsessive-compulsive disorder [34], and schizophrenia [35]. In these studies, neuropsychological test scores are used as behavioral data, in addition to the labels that represent diagnostic groups and age. To the best of our knowledge, this is the first study to investigate associations between functional connectivity in the whole brain and multiple clinical measures for depressed patients, using PLS and its kernel variants.

Discussion
Diagnosis based on resting-state functional connectivity is a challenging task due to the high-dimensionality and co-linearity of data. Recent studies have demonstrated that depressed patients can be distinguished from healthy controls by means of their functional connectivity by applying conventional methods, such as support vector machine [9,13]. Since binary labels are ultimately abstracted information about depression that ignores the severity of symptoms, it is worth considering more detailed information, such as BDI-II, SHAPS, and PANAS(n) to build more sophisticated models. Our study demonstrated that projecting functional connectivity data into a low-dimensional latent space, can predict clinical measures, and can also improve depression diagnostic accuracy.
To separately identify neural circuits associated with anhedonia and negative mood is a challenging task. A psychopathological study suggests that these primary symptoms result from different neural circuits and from alternation of different neurotransmitters [16]. Our results show that SHAPS and PANAS(n) are highly correlated and contributed quite similarly to each latent component (Fig 6), suggesting that further investigation and different approaches may be required to support psychopathological studies from the point of data driven analysis.
We also evaluated extended AAL generated by subdividing standard AAL into 600 regions to examine if the finer atlas could be used to improve the prediction of the clinical scores [36]. The performance was significantly worse than that of standard AAL (S6 Table) and the selected functional connections were inconsistent. Since analysis of brain imaging data with limited sample size highly depends on the choice of ROI, the finer atlas does not necessarily provide better prediction performance. Therefore, it is fare to note that further research is required to validate the best atlas.

Contributing brain regions
Identification of relevant brain regions in functional connectivity analysis yielded the following three observations: (1) connections between the default mode network and other regions, such as the right superior frontal gyrus and the left supplementary motor area are relevant (2) the area. The superior frontal gyrus, as a critical region in cognitive tasks, was previously reported to be associated with depression [41]. While the supplementary motor area is known to be responsible for motor control, it is also reportedly related to some subtype of depression [42]. Our results support these results.
Second, our results suggest that the insula is associated with depression. Some meta-analysis of PET and fMRI studies revealed that the insula plays an important role in regulation of emotion [43,44]. Similarly, resting-state fMRI studies have indicated that the insula is directly associated with depression [45,46].
Finally, we showed that three connections between the right cuneus, located in the visual cortical area, and the cerebellum, negatively influence depression. While visual processing is believed not to be affected in depression, some previous studies have suggested that it is associated with bipolar disorder [47]. Moreover, regional homogeneity (ReHo) interpreted as a measure of localized synchrony in resting-state fMRI was decreased [48]. The cerebellum is usually considered to be responsible for motion control, but our results indicate that it may also be involved in regulation of mood and cognitive processing associated with symptoms of depression. Some fMRI studies demonstrate that this area is responsible for various types of information processing [49,50], and resting-state functional connectivity studies imply that the cerebellum may be critical for the distinction between depressed patients and healthy controls [13,51].

Conclusion
In summary, we employed partial least squares regression and its kernel variants to predict clinical measures of subjects using resting-state functional connectivity. Diagnosis of depression based on predicted clinical scores performed better than classification algorithms attempting diagnoses directly from functional connectivity. Moreover, analysis of latent variables identified functional networks relevant to the diagnosis of depression. These results suggest that a low-dimensional representation derived using PLS is beneficial for objective diagnosis. Further investigations are required to separate the two neural circuits associated with two primary symptoms, anhedonia and negative mood.
Supporting information S1 Appendix. Partial least squares regression. (PDF) S1  Table. Classification performance. KPLS-Poly(2) followed by LDA significantly outperformed direct LDA, SVM, and OLS followed by LDA in accuracy (adjusted for multiplicity using the Bonferroni-Holm method with significance level α = 0.05). (PDF) S5 Table. Classification performance without subjects over 60. KPLS-Poly(2) followed by LDA significantly outperformed direct LDA and OLS followed by LDA in accuracy (adjusted for multiplicity using the Bonferroni-Holm method with significance level α = 0.05). (PDF) S6 Table. Root mean squared errors in extended AAL. Performance with the standard AAL outperformed extended AAL significantly. (adjusted for multiplicity using the Bonferroni-Holm method with significance level α = 0.05). (PDF)