Hierarchical Interactions Model for Predicting Mild Cognitive Impairment (MCI) to Alzheimer's Disease (AD) Conversion

Identifying patients with Mild Cognitive Impairment (MCI) who are likely to convert to dementia has recently attracted increasing attention in Alzheimer's disease (AD) research. An accurate prediction of conversion from MCI to AD can aid clinicians to initiate treatments at early stage and monitor their effectiveness. However, existing prediction systems based on the original biosignatures are not satisfactory. In this paper, we propose to fit the prediction models using pairwise biosignature interactions, thus capturing higher-order relationship among biosignatures. Specifically, we employ hierarchical constraints and sparsity regularization to prune the high-dimensional input features. Based on the significant biosignatures and underlying interactions identified, we build classifiers to predict the conversion probability based on the selected features. We further analyze the underlying interaction effects of different biosignatures based on the so-called stable expectation scores. We have used 293 MCI subjects from Alzheimer's Disease Neuroimaging Initiative (ADNI) database that have MRI measurements at the baseline to evaluate the effectiveness of the proposed method. Our proposed method achieves better classification performance than state-of-the-art methods. Moreover, we discover several significant interactions predictive of MCI-to-AD conversion. These results shed light on improving the prediction performance using interaction features.


Introduction
Alzheimer's disease (AD) currently affects about 5.3 million people in the US. It is the most common type of dementia, accounting for 60{80% of age-related dementia cases [1]. Since the therapeutic intervention is most likely to be beneficial in the early stage of the disease, an earlier and more accurate diagnosis of AD is highly preferred. Mild Cognitive Impairment (MCI), an intermediate cognitive state between normal elderly people and the AD patients [2], has attracted increasing attention, since it offers an opportunity to target the disease status early. Patients with MCI are at high risk of progression to AD, with an estimated annual conversion rate of 10%{15%. If the MCI to AD conversion probability can be accurately estimated, early stage therapies can potentially be introduced to treat or cure the disease. It helps lessen the time and cost of clinical trials. Thus, studies on predicting conversion from MCI to AD have recently attracted considerable attentions [3][4][5][6][7]. Two major research questions are: how to build a model to accurately predict MCI-to-AD conversion? how to identify biosignatures most predictive of the conversion?
However, predicting conversion from MCI to AD is a challenging task, and the prediction performance of existing methods is not satisfactory [5]. Most existing work focus on finding the most predictive biosignatures and they ignore the interactions between different biosignatures. Intuitively, fitting models with interactions can provide more information, and for complex prediction problems traditional additive models are insufficient [8][9][10]. Several recent work explore the underlying interactions between different biosignatures about AD. For example, Wang et al. [11][12][13] demonstrate the power of using the interlink between neuroimaging measures and cognitive scores by their proposed multimodal multitask learning structure. These motivate the study of an interaction model for modeling MCI-to-AD conversion in this paper, aiming to identify the most predictive biosignatures and the underlying relevant interactions simultaneously.
A significant challenge in the use of interactions models lies in the large number of features introduced by considering all pairwise interactions. Consider a dataset with 100 features; including all pairwise interactions leads to thousands of new features. In [14][15][16], only the interactions satisfying hierarchical constraints are allowed to be included in the prediction models, which can be used to prune a large number of interactions. Inspired by Bien et al. [8], we consider the effects of the co-occurrence of different biosignatures for the prediction task. To address the high dimension/small sample size problem, we consider a hierarchical interaction feature selection method, which prunes features by sparsity and hierarchical constraints [8]. Then the selected features and their relevant interactions are fed into classifiers such as support vector machines (SVM) [17] and random forest (RF) [18]. In addition, we employ the stability selection method [19] to identify the most predictive features and underlying interactions.
We have evaluated the hierarchical interactions model using the ADNI dataset. Specifically, we use a set of 293 MCI subjects from ADNI, including 161 MCI non-converters and 132 MCI converters (the conversion was considered over the course of a 4-year follow-up period). Experimental results show that the hierarchical interactions model achieves 74:76% prediction accuracy, much higher than competing methods. The most predictive biosignatures found by our method are consistent with those identified by state-of-the-art methods. Our experiments also identify several significant interactions predictive of the MCI-to-AD conversion.

Background of Interaction Models
In many classification and regression tasks, traditional additive (main effects) models are often used. They assume that the effects of features are additive with different weights. However, these models are insufficient for many prediction tasks, as they fail to consider the co-occurrence effects of different features, called feature interactions. Bien et al. [8] gave two examples to describe different effects of symptoms co-occurrence in medical diagnosis. The first situation is that the co-occurrence of two symptoms may make a doctor confident that a patient has a certain disease whereas the presence of either symptom without the other would provide only a moderate indication of that disease. This situation corresponds to a positive (i.e., synergistic) interaction between symptoms. The other situation is when either one of two symptoms conveys redundant information to the doctor about the patient, thus knowing both provides no more information about the disease status. This situation is called negative interaction, which is also not additive.
Intuitively, fitting models with feature interactions may be more effective for more complex prediction tasks. However, the computation burden also increases with the high-dimensional interactions. Consider a dataset with p measured variables. There are p k interactions of order k. To alleviate the computation burden and control the model complexity, most work only focus on the case of pair-wise interaction models [8,10,20]. Here we employ the pair-wise interaction model proposed by Bien et al. [8]. Consider a regression model for predicting an outcome variable y based on a set of predictors x 1 , . . . ,x p , with pair-wise interactions between these predictors. We assume that: where e*N(0,s 2 ) is an additive noise. The additive part P j b j x j is referred to as the ''main effect'' terms and the quadratic part P j=k H jk x j x k is referred to as the ''interaction'' terms. The goal is to estimate b[R p and H[R p|p , where matrix H is symmetric (H~H T ) and the diagonal entries are zeros.
Based on the above formulation, the interaction of two different measured variables is considered as a new additive term. Such a setting can be regarded as a nonlinear extension of the original linear regression models. Intuitively, however, not all interactions make sense. To overcome the high-dimensionality problem caused by adding interactions, certain constraints are often used to extract significant interactions. The first assumption is called ''hierarchical constraints'', which means ''an interaction can be allowed into the model only when the corresponding main effects are also in the model'' [8,[14][15][16]. There are two types of hierarchical constraints including strong and weak hierarchical constraints: The strong hierarchical constraint requires that an interaction x j x k can have an effect on y(Ĥ H jk =0) only when both of the corresponding variables x j ,x k have effects on y(b b j ,b b k =0). The weak hierarchical constraint requires at least one related variable has an effect on y(b b j =0 orb b k =0). Hierarchical constraints assume that interactions have effects on y when their related variables are predictive. On the other hand, violating hierarchical constraints means an interaction can have effects on y even though the related variables are not predictive at all. As Bien et al. [8] pointed out, this is not true in practical situations. Therefore the hierarchical constraints are often enforced when fitting models with interactions.
Furthermore, as Cox mentioned in [21], due to the incomplete nature of data, it needs to isolate some interactions in a way that depends on the data. One general principle assumes that more significant main effects are more likely to introduce predictive interactions. In addition, the interactions corresponding to larger main effects intuitively have more practical effects on the output. Inspired by this principle, sparse regularizations will be imposed on coefficients b and H to focus on the reliable interactions that have larger main effects.

Hierarchical Interactions Model
Since not all the main effects and interactions contribute to the prediction model, especially when the input data is highdimensional, it is natural to employ sparse regularization to select the most predictive features. Inspired by Bien et al. [8], we apply ' 1 norm regularization on b,H. Then to enforce the hierarchical constraints, some extra constraints are also added into the hierarchical interactions model as done in [8]. The hierarchical interactions model is formulated as follows: where q(b 0 ,b,H) is the loss function and H j denotes the j-th row of H. As our final goal is to predict the conversion from MCI to AD, which is a binary classification task (y[f1,{1g), the logistic regression loss is applied here. Here we represent q(b 0 ,b,H) as the negative log-likelihood logistic binomial form: Here n is the total number of training samples. If all the constraints of Eq.(3) are relaxed, we obtain the All-Pair Lasso model. This model does not consider the hierarchical constraints. Bien et al. [8] pointed out that the symmetry and inequality constraints in Eq.(3) could enforce the solutions to satisfy the Strong Hierarchical constraints. Notice that if H jk =0, then EH j Ew0 and EH k Ew0 (by symmetry of H) and thus Db j Dw0 and Db k Dw0. While the added constraints enforce strong hierarchy, the optimization problem (3) is no longer convex. By replacing b[R p with two nonnegative vectors b z ,b { [R p , the above optimization problem (3) is equivalent to the following optimization problem: for j~1, . . . ,p: Here we can regard b z and b { as the positive and negative parts of b, i.e. b +~m axf+b,0g. Given the above equivalent problem, the feasible solution set of (b z ,b { ) is not convex due to the product constraints in (5), which makes the problem hard to be solved efficiently [8]. Therefore we remove the product constraints in (5), leading to a convex relaxation of (5) and (3), which is given as follows: Here we call it Strong Hierarchical Lasso, which only contains linear constraints and can be solved with Alternating Direction Method of Multipliers [22]. In addition, by simply removing the symmetry constraints on H, the Weak Hierarchical Lasso is obtained: In our work, we mainly focus on the Weak Hierarchical Lasso as a feature selection method to extract potential predictive biosignatures (main effects) and interactions. Because without the symmetry constraints, the unknown parameters in (7) are blockwise separable, which can be efficiently solved by blockwise coordinate descent methods. However, the symmetry constraints in (6) couple all the parameters together, which gives rise to a high computational cost. We employ the accelerated gradient descent method [23,24] to solve (7). In each step, we solve p blockwise sub-problems involving , which can be efficiently solved by an algorithm named ONEROW in [8]. More detailed information about the optimization process can be found in [8].

Framework of the Proposed Method
In the proposed framework, we first employ Weak Hierarchical Lasso with the logistic regression loss (7) to extract significant biosignatures and interactions. Note that sparse dimension reduction methods [25][26][27] can also be used for feature selection and dimension deduction, however, these methods fail to consider the hierarchical constraints of interactions. Next, we use the significant biosignatures and relevant interactions as new features and apply standard classifiers, such as Support Vector Machine and Random Forest.
Our framework is summarized as follows: 1. Data pre-processing: Given input features and outcomes Form the data matrix of the interaction features Z[R n|p(p{1) , which are also normalized in the same way. Here each column of Z corresponds to the dot product of two different feature columns from data matrix X , i.e. Z k((j1{1)(p{1)zj2)~Xkj1 |X kj2 , Vk~1, . . . ,n.
2. Hierarchical interaction model learning: Using the algorithm in [8] to solve (7) with input data X ,Z,Y , and then obtain the optimal solutionsŵ w~( learned in the previous step, we select the most significant features with the largest components. We use d~0:0001 as the thresholding value to select the significant features. Let K m be the index set of main effect features, i.e., k[K m denotes a main effect feature. If Db b z k {b b { k D §d, the main effect feature x k is selected as a significant feature. Similarly, the interaction of main effect features x j x k is selected when DĤ H jk Dwd. The selected interactions obviously satisfy hierarchical constraints. We named the algorithm as the weak hierarchical lasso feature selection (wHLFS) method. 4. Classification: with the selected features as input, involving main effects and interactions, classification methods are employed to build a prediction model.

Stability Selection
To identify the most predictive biosignatures and interactions for our conversion predicting task, we propose to employ stability selection [19] to quantify the importance of features selected by the above hierarchical interaction feature selection method. The stability score (between 0 and 1) of each feature measures the importance of the feature. In this paper, we propose to use the stability selection with weak hierarchical lasso feature selection (wHLFS) to analyze the importance of biosignatures and their interactions. This can potentially reveal how biosignatures and their interactions influence the prediction performance.
The stability selection algorithm with wHLFS is given as follows. Let K m ,K int denote the index of main effect features and interaction features, respectively. Let k[K m be the index of a particular main effect feature, and let (j,k)[K int be the index pair of two main effect features for a particular interaction. D denotes the regularization parameter space and c is the stability interaction is the optimal solution of wHLFS on B (i) . Then the set of selected features, involving main effects and interaction, are respectively denoted as follows: We repeat this process for c times for a specific l, then repeat this procedure for all l[D. Finally we obtain the stability score for every main effect feature k[K m and interaction feature (j,k)[K int : Then the stable features of main effects and interactions are the features with largest S m (k) and S int (j,k), respectively. In our implementation, we present the top 12 main effect features and top 34 interactions features (c~10). Furthermore, we are interested in finding the positive and negative interactions between different biosignatures. We define the stability expectation score for each feature as follows: where sgn( : ) is the sign function (if x §0, sgn(x)~1 otherwise sgn(x)~{1). Thus we can find the top positive or negative interactions among allŜ S ie (j,k). In our experiments, we list top 10 positive and top 10 negative interactions. To better understand biological meanings of negative and positive interactions, we also list the stable expectation scores of the related main effect biosignatures.

Subject Characteristics
We use 293 MCI subjects from ANDI in our study, including 161 MCI non-converters (referred to as negative samples) and 132 MCI converters (referred to as positive samples). We only use a subset of the MCI subjects from ADNI which have MRI scans at the baseline. The conversion was considered over the course of a 4-year time period. The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering, the Food and Drug Administration (FDA), private pharmaceutical companies and nonprofit organizations, as a $60 million dollar, 5-year publicprivate partnership. The participants in ADNI receive serial MRI, PET, other biological markers, and clinical and neuropsychological assessments.
In this study, the MRI features, denoted as 'M', were based on the imaging data from ADNI database processed by the UCSF team. They performed cortical reconstruction and volumetric segmentations with the FreeSurfer image analysis suite (http://surfer.nmr.mgh.harvard.edu/). The processed MRI features can be grouped into 5 categories: average cortical thickness (CTA), standard deviation in cortical thickness (CTStd), the volumes of cortical parcellations (Vol.Cort), the volumes of specific white matter parcellations (Vol.WM), and the total surface area of the cortex (Surf.A). There are 305 MRI features in total. We also use four kinds of features, including demographic and genetic information, baseline cognitive scores, and lab tests. We use 'META' to denote all these four types of features. There are 52 META features in total (shown in Table 1). The baseline cognitive scores are obtained when the patient performs the screening in the hospital for the first time. Here we also consider the performance by using the remaining 22 META features without the baseline cognitive scores, denoted as 'META-22'.

Experimental Setup
We focus on the classification of MCI converters and MCI nonconverters. We use different combinations of features including MRI (M), META (E) (see Table 1) and META without baseline cognitive scores (META-22) to classify the MCI converters and MCI non-converters.
To obtain more reliable results, we randomly split the dataset 10 times. Each time we use 10% samples for testing and use the remaining data for training. Five-fold cross-validation is performed on the training dataset to select best parameters. As we use wHLFS to select features and compare it with Lasso and All-pair Lasso method, the classification methods employed in the following step are identical. The loss function used in these competing feature selection methods is the square loss function. Here we use two well-known classifiers including linear SVM (implemented in lib-linear [28]) and Random Forest (referred to as RF) with default settings. For comparison purpose, we also provide the performance by using SVM with RBF kernel (referred to as SVM-Kernel), which is another type of nonlinear prediction model. Furthermore, we present the performance based on sparse Logistic Regression (referred to LR), which is used in [3]. These two methods use main effect features.
For our wHLFS method, the regularization parameter l is searched across a range : ln[½10,30, where n is the number of training samples. For the competing methods, the parameter l is searched across a range: l[½0:05,0:8. The relative performances of different methods are evaluated using metrics of accuracy, sensitivity and specificity, defined as follows:

|100%
In the statistical tests, the sensitivity measures the proportion of true positives that are correctly identified; specificity measures the proportion of true negatives that correctly identified. The accuracy measures the overall correct classification rate.
We also conducted a series of paired t-tests to demonstrate the significance of the experimental results. Since we randomly split our dataset into 10 folds and use 9 folds for training and the remaining fold for testing, there are 10 different training and testing splits. We select 16 different values of ln from the parameter space ln[½10,30. Therefore, we obtain 160 groups of classification results from different combinations of methods and input feature datasets. We performed paired t-tests on these groups of classification results.

Results and Discussion
Classification Performance Table 2 summarizes the classification results using all META and MRI features. As we partition all the samples into 10 folds and perform the training/testing process 10 times, we also present the standard deviation. Among all the feature selection methods and classification methods, wHLFS combined with random forest achieves the best performance with an accuracy of 74:76%. We also report the results when no feature selection method is used. Based on the original main effect features, sparse Logistic Regression gives relatively good results. Although SVM-kernel is also a nonlinear method, its performance is not very competitive. As shown in Table 2, the performance of linear SVM and RF can be improved by feature selection using Lasso. In addition, we can observe from Table 2 that directly adding high-dimensional interactions without imposing any hierarchical constraints does not achieve satisfactory results. Although using All-pair Lasso to select features can slightly improve the classification performance when building models with interactions, the final classification results are even worse than the ones obtained without including interaction features. This may be due to the ''curse of dimensionality'' [29]. The wHLFS method can greatly improve the performance.
We also conduct a series of paired t-test for comparing the performance of wHLFS+RF and the other combinations of methods. Here we consider all combinations of methods in each stage according to Table 2, e.g. SVM, RF, LR and SVM-Kernel methods with original main effect features, Lasso, Lasso+RF and Lasso+SVM with the selected main effect features, SVM and RF with all the main effect and interaction features, All-Pair Lasso method and wHLFS method. Fixing the training and testing datasets and the setting of parameters, we obtain the classification performance from different combinations of methods in terms of accuracy. Recall that we have 160 groups of classification results from multiple combinations of methods with different training and testing datasets and different settings of parameters. Therefore, a series of paired t-test are performed on the results of wHLFS+RF and the results of a competing method. The null hypothesis is that the classification accuracy of wHLFS+RF is not higher than that of the competing method. The results of hypothesis testing are given in Table 3. The results demonstrate the significant improvement of our proposed method over competing methods. Note that wHLFS+RF is not significantly better than wHLFS+SVM (pvalue is around 0:05). This is consistent with the results given in Table 2, where wHLFS+SVM also provides better results than other methods except for wHLFS+RF. These results demonstrate the promise of wHLFS method for building accurate prediction models. Note that the relative improvements in Tables 2 and 3 are different, since the results in Table 2 are obtained via parameters tuning using cross-validation, while the results in Table 3 are not based on cross validation.

Effects of Different Input Datasets
In Table 4, we present the performances of different input feature datasets with wHLFS+classification methods. The input feature datasets used here include META (E), MRI (M), META without baseline cognitive scores combined with MRI (META-22+M), and META combined with MRI (E+M). As shown in Table 2, META features are more effective than MRI features for our specific tasks. Comparing the results with different input features, we can find that the baseline cognitive scores are the most effective features for predicting MCI-to-AD conversion, which is consistent with the results in [3]. Hypothesis testing is also conducted to demonstrate the effectiveness by using META combined with MRI (E+M) feature datasets. There are 160 groups of results over accuracy for hypothesis testing. The null hypothesis is that the classification performance using E+M is no better than that using a different feature combination in terms of accuracy. As shown in Table 5, wHLFS+RF and wHLFS+SVM achieve significant improvements by using the E+M feature combination.

Effects of l in wHLFS
We illustrate the effect of l in the wHLFS method in Figure 1. The META and MRI datasets are used in this experiment. The leave-one-out results are reported when we use different l values (ln[½10,30, n is the fixed number of training samples). From Figure 1 we can observe that the best choices of ln is 15, with RF as the classification method. In this case, the number of selected main effect features is around 14, and the number of selected interaction features is about 22. We can also observe that different classification models achieve the best performance at different values of ln. When ln increases, the number of selected features will monotonically decrease (as shown in Figures 2,3). We can observe that the performance of different methods decreases with a larger lnw24 (as shown in Figure 1), since there are very few features selected for the final classification models. We can also observe from Figure 3 that the number of selected interactions by wHLFS is small, which demonstrates the effectiveness of wHLFS method in pruning the high-dimensional input features. Moreover, the selected interactions lead to a good classification performance.

Stability Selection of Main Effects and Interaction Features
In this experiment we evaluate the feature selection results of our wHLFS method. Here we employ the stability selection method on the input feature dataset META (E)+MRI (M). The parameter searching space is D : fln[½10,30g. We present the        most stable main effect features with stable scores 1 in Table 6, which includes 12 stable main effect features. The baseline information of the 293 MCI subjects by the diagnostic group (e.g. MCI Converters and MCI Non-converters) on these stable biosignatures is also summarized in Table 6. There are significant between-group differences in these biosignatures. Both  [3]. Moreover, we can observe that almost all the significant stable features in the META dataset are the baseline cognitive scores. Figure 4 shows the stability results of interaction features. Here we list the top 34 stable interactions (the names of top 10 interactions are detailed in Table 7). From Figure 4, we can observe that many significant stable interactions are between different datasets, such as Next, we examine the stable expectation scores of every interaction feature, which illustrate the negative or positive effects of the interactions. In Figure 5, we list top 10 negative and top 10 positive interactions. A positive stable expectation score means the interaction has a positive effect on the outputs y~1, while a negative value means a negative effect. An interaction that has a   larger absolute value of stable expectation score is of more practical importance. Figure 6 gives the stable expectation scores of related biosignatures for the significant interactions. From Figure 5, we can observe that the stable interaction between ''ADAS-sub1'' and ''APOE'' is negative, while the stable expectation scores of these two biosignatures are positive (shown in Figure 6). This means that higher values of these two biosignatures lead to higher conversion probability from MCI to AD. Positive effects of biosignatures but negative effects of their interaction mean either of these two biosignatures conveys abundant information about the conversion probability, so knowing both may not provide additional information. The negative component of their interaction reduces the additive effects of these two predictive biosignatures. On the other hand, from   effect. The co-occurrence reveals more information, while either one only gives a moderate indication. The above two kinds of interactions overcome the drawbacks of traditional additive models, leading to better performances. Furthermore, finding the underlying useful interactions sheds light on improving the prediction performance with more predictive features. We expect this to be a promising approach for other difficult disease prediction tasks.

Conclusion
In this paper we study the effectiveness of hierarchical interaction models for predicting the conversion from MCI to probable AD and identifying a small subset of most predictive biosignatures and relevant interactions. We employ a weak hierarchical interaction feature selection method to select a small set of most predictive biosignatures and interactions. We also propose to use the stable expectation scores of interactions and their related biosignatures to analyze the negative and positive interaction effects. This may provide useful information for clinicians and researchers to find the significant interaction effects of different biosignatures. Our approach sheds light on how to improve the MCI-to-AD prediction performance using biosignature interactions.
In this study, we focus on weak hierarchical interaction model. We plan to study the strong hierarchical interaction model in the future. In addition, further analysis is needed to provide deeper biological interpretations of the biosignature interactions. We also plan to examine the effectiveness of the hierarchical interaction model on predicting tasks of other common comorbidities, such as cardiovascular risk factors disease and depression, family history of dementia, prior head trauma etc.