Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Machine learning based multi-modal prediction of future decline toward Alzheimer’s disease: An empirical study

  • Batuhan K. Karaman,

    Roles Conceptualization, Data curation, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft

    Affiliations School of Electrical and Computer Engineering, Cornell University and Cornell Tech, New York, NY, United States of America, Department of Radiology, Weill Cornell Medicine, New York, NY, United States of America

  • Elizabeth C. Mormino,

    Roles Conceptualization, Writing – review & editing

    Affiliation Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, United States of America

  • Mert R. Sabuncu ,

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Resources, Writing – original draft

    Affiliations School of Electrical and Computer Engineering, Cornell University and Cornell Tech, New York, NY, United States of America, Department of Radiology, Weill Cornell Medicine, New York, NY, United States of America

  • for the Alzheimer’s Disease Neuroimaging Initiative

    Membership of the Alzheimer’s Disease Neuroimaging Initiative is provided in the Acknowledgments.


Alzheimer’s disease (AD) is a neurodegenerative condition that progresses over decades. Early detection of individuals at high risk of future progression toward AD is likely to be of critical significance for the successful treatment and/or prevention of this devastating disease. In this paper, we present an empirical study to characterize how predictable an individual subjects’ future AD trajectory is, several years in advance, based on rich multi-modal data, and using modern deep learning methods. Crucially, the machine learning strategy we propose can handle different future time horizons and can be trained with heterogeneous data that exhibit missingness and non-uniform follow-up visit times. Our experiments demonstrate that our strategy yields predictions that are more accurate than a model trained on a single time horizon (e.g. 3 years), which is common practice in prior literature. We also provide a comparison between linear and nonlinear models, verifying the well-established insight that the latter can offer a boost in performance. Our results also confirm that predicting future decline for cognitively normal (CN) individuals is more challenging than for individuals with mild cognitive impairment (MCI). Intriguingly, however, we discover that prediction accuracy decreases with increasing time horizon for CN subjects, but the trend is in the opposite direction for MCI subjects. Additionally, we quantify the contribution of different data types in prediction, which yields novel insights into the utility of different biomarkers. We find that molecular biomarkers are not as helpful for CN individuals as they are for MCI individuals, whereas magnetic resonance imaging biomarkers (hippocampus volume, specifically) offer a significant boost in prediction accuracy for CN individuals. Finally, we show how our model’s prediction reveals the evolution of individual-level progression risk over a five-year time horizon. Our code is available at


Alzheimer’s Disease (AD) is the most common type of dementia among the elder population, accounting for nearly 70% of all dementia patients [1] and ranking as the seventh leading cause of death in the United States [2]. Many mechanisms in the development of AD have been uncovered by decades of experimental and clinical studies [35], but the jigsaw is still unsolved. In the realm of AD, public and private databases [610] serve as important resources for the application of machine learning algorithms that help characterize disease heterogeneity [11], guide therapy [12], and develop and evaluate potential treatments [13, 14]. An area where machine learning can play a crucial role is the prediction of future clinical decline at the individual level, which can inform clinical and other life decisions.

Clinically, in the context of Alzheimer’s disease, individuals are classically grouped into one of the following three stages: cognitively normal (CN), mild cognitive impairment (MCI), and AD dementia. MCI is considered a high-risk, transitionary stage between healthy aging and dementia. Future clinical decline toward dementia is considered to be more predictable in those with MCI than in CN individuals [15, 16]. As our results further corroborate, CN-to-MCI conversion is inherently a more difficult prediction problem than MCI-to-AD conversion. The vast majority of the published studies dealing with individual-level future decline predictions with machine learning focus on early detection of MCI-to-AD conversion [17].

There is growing literature showing factors, including certain biomarkers, that predict progression from CN to MCI [18, 19]. This work has converged to suggest that the risk associated with specific factors is relatively small, and requires a long follow-up to observe the effect. Furthermore, papers tackling the CN-to-MCI conversion prediction problem with machine learning have been relatively limited [20, 21]. Thus, there is a need to understand what combination of variables can yield accurate individual-level predictions; and whether the significance of the variables changes as a function of disease stage (e.g., CN or MCI at baseline).

Many prior studies have primarily focused on building models that predict future conversion within a specific time horizon, e.g., three years [22, 23]. Although some of these studies test their modeling strategy for variable follow-up years, they do this by training new models for each time horizon. Studies that utilize survival (event-time) models can, in theory, make predictions for any future time-point [24], yet, they need to make strong assumptions about the evolution of the underlying hazard function, which might potentially limit performance.

In this work, we rely on the neural network (deep learning) framework, which gives us the flexibility to explore the effect of various modeling choices, namely nonlinearity and predicting arbitrary time horizons, while holding other design parameters constant. We implement three different classifiers, two of which are trained to predict the clinical status (CN, MCI, or AD) at a single time point in the future. The first of these two models is linear and called the “Linear Single-year Model (LSM)”. The other one is nonlinear and referred to as the “Nonlinear Single-year Model (NSM)”. In the third model, we employ a nonlinear architecture and modify it to make it capable of predicting the clinical status at any time point in the future. We refer to this as the “Nonlinear Multi-year Model (NMM).” Comparing these models allow us to deduce the predictive importance of nonlinear models and accounting for different time horizons. For instance, our results verify that a linear model (LSM) can yield high-quality predictions for the relatively easy MCI-to-AD conversion prediction problem, whereas higher capacity nonlinear models are needed for making more accurate predictions about the more challenging task of predicting CN-to-MCI conversion.

Our analyses further offer some novel insights. In CN individuals, predicting who will convert to MCI within a year is easier than predicting for a 5-year time window. For individuals with MCI, however, the situation is reversed. It is harder to predict who will progress to AD in the shorter term. We train our models to handle arbitrary missingness patterns in the input data, which in turn, allows us to interrogate the predictive value of different data types. For example, our results suggest that the molecular biomarkers we consider in our study are very helpful in the MCI stage, but not as much in CN individuals. On the other hand, structural magnetic resonance imaging (MRI) biomarkers (hippocampus volume, specifically) offer a significant performance boost for CN-to-MCI conversion but not MCI-to-AD conversion. Finally, we present our model’s prediction of future conversion risk as a continuous function of time, which reveals different trajectories.

Materials and methods

In this section; we first present the dataset, which was derived from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database [25]. We then discuss important implementation details that were critical in handling missingness, and unbalanced classes at different time points.


All participants used in this work are from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database. ADNI aims to evaluate the structure and function of the brain across different disease states and uses clinical measures and biomarkers to monitor disease progression. We select the participants who did not have clinical AD at the baseline (screening) visit and had at least one follow-up diagnosis within the next five years (n = 1404). We excluded the CN baseline participants who converted to AD in the next five years because they are very few (n = 6), the CN baseline participants who converted to a later stage but reverted back to an earlier stage (n = 23), and the MCI baseline participants who either were diagnosed as CN in a later follow-up year (n = 87) or converted to AD and reverted back (n = 18) since these subjects might have been diagnosed incorrectly at some point. Including these individuals did not alter our main conclusions. After the exclusions, we are left with 1404 participants. We note that all individuals were either CN or MCI at baseline. Table 1 lists summary statistics for the participants; including sex, age, number of years of education completed, count of Apolipoprotein E4 (APOE4) allele, Clinical Dementia Rating (CDR), and Mini Mental State Examination (MMSE) scores at baseline.

Table 1. Summary statistics of the participants at baseline.

A critical aspect of the data, as is common in many real-world longitudinal studies, is that there are missing follow-up visits, with imperfect timings, and many subjects drop out of the study before the planned completion. Table 2 shows the number of available subjects in each diagnostic group for annual follow-up visits. We note that Table 2 can be used to infer the number of subjects progressing from one stage to another during follow-up years of interest. In Table 2, and in all subsequent analyses, any subject who progressed (from CN to MCI, or from MCI to AD) before dropping out of the study was considered to remain MCI or AD until year 5. Non-converter subjects without a visit in a certain follow-up year are not used for either training or testing in that follow-up year. Some subject groups have different follow-up visit schedules. For example, in ADNI-2 and ADNI-3, CN baseline participants are only clinically evaluated every other follow-up year. Therefore years 1 and 3 have fewer CN diagnoses than years 2 and 4, respectively. ADNI includes a limited number of subjects who have been monitored for more than five years. However, due to the very limited number of visits beyond 5 years, we excluded these timepoints from our analyses.

Table 2. The number of available subjects in each diagnostic group for annual follow-up visits.

Input features

We use the clinical data and biomarkers collected at baseline as our input features. Clinical data includes subject demographics (age, gender, number of years of education completed, ethnicity, race, marital status), genotype (number of APOE4 alleles), clinical assessments (clinical dementia rating, or CDR; Activities of Daily Living, or FAQ; Everyday Cognition, or ECog), cognitive assessments (Mini-Mental State Exam, or MMSE; Alzheimer’s Disease Assessment Scale, or ADAS-Cog; Montreal Cognitive Assessment, or MoCA; Rey Auditory Verbal Learning Test Trials 1–6; Logical Memory Delayed Recall; Trail Making Test Part B; Digit Symbol Substitution, Digit and Trails B versions of Preclinical Alzheimer’s Cognitive Composite score [26, 27]), and the baseline diagnosis (CN or MCI). The biomarkers are Cerebrospinal Fluid (CSF) measurements [28, 29] (Amyloid-Beta 1–42; Total Tau, or T-Tau-; Phosphorylated Tau, or P-Tau), Magnetic Resonance Imaging (MRI) volume measurements [30, 31] (Ventricles; Hippocampus; WholeBrain; Entorhinal; Fusiform; MidTemp; Intracranial Volume, or ICV; all computed using the FreeSurfer software [3251]), and Positron Emission Tomography (PET) standardized uptake value ratio (SUVR) scores (for following tracers: Fluorine-18-Fluorodeoxyglucose, or FDG; Florbetapir, or AV45; Pittsburgh Compound, or PIB) [52, 53]. We note that CSF, FDG, and PIB biomarkers are referred to as molecular biomarkers. Furthermore, we employ single, global PET SUVR measurements instead of regional values.

The regional volume measurements derived from the MRI scans were computed, quality controlled and publicly made available by UCSF researchers. In this pipeline, the images are processed to implement the following steps: Talairach transform computation, intensity normalization, skull stripping, creation of the white-matter and pial surfaces, segmentation of the gray and white matter volumetric regions of interest and creation of the cortical parcellation as described in [54]. Note that we input ICV as a separate feature, instead of normalizing other volumetric MRI measurements with it.

The ADNI study consists multiple phases (1, GO, 2, and 3). Each phase implemented a slightly different data acquisition protocol. Additionally, as mentioned above, follow-up data collection was also heterogeneous, with missing visits and non-uniform visit intervals. The degree of missingness for the different baseline data modalities is shown in Table 3 for the 1404 participants we use. Rather than dropping out the subjects with incomplete baseline data, we substitute placeholder values for their missing features and keep them in our dataset. Our substitution procedure consists of two parts. First, we record the binary missingness mask for the feature set. Each participant has their own binary missingness mask indicating what variable is observed or not for that particular individual. Then, following prior work [55], we perform mode substitution for missing categorical variables (sex, ethnicity, race, marital status, APOE4), and mean substitution for missing numerical variables. It is true that when the class labels are unbalanced, these substituted values will be biased by the values in the majority class. However, we would like to emphasize that we consider these substituted values as dummy placeholders. In other words, we use mode/mean values solely to make sure that the substituted values are within an appropriate range and thus the numerical optimization is not compromised. By concatenating the missingness mask to the feature vector, we expect the model to learn to treat these placeholders appropriately. To prevent any information leakage, we substitute the mode and mean values computed in the non-missing portion of the training set as placeholders for the missing values in training, validation, and test sets. We compute a single mode/mean value for each feature, weighing all non-missing values in the training set equally.

Table 3. The degree of missingness (%) in different baseline data modalities for two patient groups.

The categorical variables except the baseline diagnosis are one-hot encoded (i.e., are represented with dummy variables encoding presence) and numerical variables are z-score normalized in the last step of feature processing. We note that the z-score normalization is first performed on training data and mean and variance values are saved to be used for validation and test data later on, which is similar to what we do in the second step of the imputation procedure.

We have six discrete (categorical) variables, which are sex (encoded as a one-hot vector for either female or male), ethnicity (encoded as a one-hot vector for either Hispanic/Latino or not Hispanic/Latino), race (encoded as a one-hot vector for one of the following: Asian, Black, Hawaiian/Other PI, Indian/Alaskan, More than one, White), marital status (encoded as a one-hot vector for one of the following: divorced, married, never married, widowed), number of APOE4 alleles (encoded as 0, 1 or 2 copies of the E4 allele), and baseline diagnosis (a scalar that is 0 for CN and 1 for MCI). Real-valued variables are the number of years of education completed, clinical test scores, cognitive assessments, and biomarker values. All real-valued (numerical) features are scalars except for the clinical assessment of ECog, and the cognitive assessments of ADAS-Cog, and Rey Auditory Verbal Learning Test Trials 1–6, which are vectors with dimensionalities 14, 3, and 4, respectively. In total, we have 45 real-valued features. We note that we also compute a binary missingness mask for the input feature set. In the mask, there is no entry for the baseline diagnosis, since that variable has no missingness. Therefore, the binary missingness mask has a dimensionality of 50. Concatenating the categorical features, numerical features, and binary missingness mask yields a feature vector of length 113.


We are interested in the prediction of an individual’s future diagnostic status (CN, MCI, or AD) based on the input features at a baseline visit. A large number of studies have looked at this question as a classification problem, often at a single follow-up time, e.g., three years after the baseline. However, this formulation has two drawbacks. First, many subjects might drop out of the study before that follow-up time, which means these subjects cannot be used for training. Secondly, this approach groups together subjects who convert after the intended follow-up time, with those who are stable through the study. To distinguish these two subject groups, one would need to train a new model from scratch corresponding to a different follow-up time. An alternative approach that addresses these issues involves survival modeling [56]. However, these methods require strong assumptions about the underlying hazard function that can constrain the model’s performance.

All our models follow the neural network architecture template depicted in Fig 1, which was designed based on optimizing empirical performance on validation data in a single split. The output is a length-3 probability vector, computed by a soft-max layer, that corresponds to the individual’s CN, MCI, and AD probabilities at the future time point. In the single-year models, the prediction is for a fixed follow-up time and thus time-to-follow-up is not an input feature. Therefore, the single-year models have an input layer width of 113, and the input layer width of the multi-year model is 114. We train a separate single-year model for each time horizon. The linear single-year model, LSM, is made up of linear (fully connected) layers, whereas its nonlinear counterpart, NSM, contains nonlinear activation functions, which are elementwise rectified linear units, or ReLUs, between linear layers. The most flexible model, the nonlinear multi-year model, or NMM, accepts the time-to-follow-up (in months) as an input feature and computes the output corresponding to that input value. Thus, we train a single NMM for all follow-up times. The NMM can be viewed as a family of models, parameterized by the follow-up time. We note that all three models have roughly the same number of learnable parameters.

Fig 1. Feed-forward, fully-connected neural network architecture.

Nonlinear models have rectified linear units (ReLU) between layers. Final layer implements a soft-max. Ll: Number of neurons in layer l. Input features include following. Demog.: Deomographics. Clinical Assess.: Clinical Assessments. Cog. Assess.: Cognitive Assessments. Baseline DX: Baseline Diagnosis. prob.: probability.

We also experimented with two alternate models. The alternative single-year linear model was a standard linear regression model, implemented as a single fully connected layer neural network, with L2 penalty (weight decay) on the weights (coefficients). This is equivalent to a ridge regression approach. As the results presented in the Supplementary Material demonstrate this model performs no better than the LSM model described above. The second alternative is a slight modification of the NMM model, where the input time-to-follow-up feature is encoded as the closest annual visit time. This model (results in S1 and S2 Tables) performed very close to the NMM model we present here.

Loss function

We use categorical cross-entropy loss to train our neural networks: (1) where N is the number of training datapoints, 〈, 〉 is the inner product operator, yi is the one-hot encoded vector for the ground truth label of sample i (i.e., is a dummy variable vector encoding presence), and is the probability vector of the sample point i that is computed by the classifier. The expression in Eq (1) represents the average of the losses across the entire training dataset, which implies that each sample has the same weight. This is undesirable in unbalanced problems due to the fact that the contribution of the majority class to the loss function will be higher compared to the minority class.

As can be seen in Table 2, there are two types of imbalance we need to consider.

  1. The distribution of class labels varies significantly over the years. The number of CN baseline participants who convert to MCI is smaller than the number of non-converters over the five-year period, but the relative difference shrinks in later years. For participants who are MCI at baseline the situation is more drastic. Those who convert to AD represent a small minority in the early years, yet MCI-to-AD converters are the majority at the 5-year mark.
  2. The number of available clinical labels decreases with each follow-up year, since individuals drop out of the longitudinal study.

Naively using the loss term in Eq (1) would encourage certain types of errors. For example, the model would care less about accurately classifying CN-to-MCI converters, particularly in earlier years. This would affect all three of our models. The second imbalance factor, on the other hand, would exclusively impact the performance of NMM. With the loss in Eq (1), NMM would pay less attention to later follow-up years than earlier ones which means the performance of NMM would suffer in longer time horizons. We note that the LSM and NSM approaches are not affected by this because a separate model is trained for each follow-up year.

There are well-established ways of addressing imbalance issues. Under-sampling the majority class, over-sampling the minority class, and using re-weighted loss functions are the most popular options. In this work, we employ a loss re-weighting scheme. In this approach, the model is penalized more for an error in the minority class than an error in the majority class, using sample-level weights. The loss function we use in this work is (2) where wi is the weight of the sample point i. We propose a weighting scheme that accounts for both sources of imbalance we discussed above. Although converter CN baseline participants and non-converter MCI baseline participants belong to the same ground-truth class, we do not weigh those samples equally as they represent different prediction scenarios. Therefore, for a given follow-up year, we consider four possible categories of participants: CN baseline non-converters, CN baseline converters, MCI baseline non-converters, and MCI baseline converters. In the first step, we compute the weights for each category as one over the size of each group. In step 2, we scale these weights so that the total weight of each follow-up year is equal. By doing so, we are addressing the second imbalance.

In the single-year models, the ground-truth clinical status, e.g. for one year after the baseline, was the diagnosis made at the visit corresponding to that follow-time. We note that the timing of this visit is typically not exact and can deviate by several months. For example, a planned 1-year follow-up could have occurred around 15 months after the baseline visit. Thus, the ground-truth labels can be viewed as noisy. On the other hand, for the NMM model, since the follow-time is not fixed and is treated as an input variable (coded in months), the ground-truth label can be viewed as more accurate. That said, as we described above, we implemented a version of the NMM model that accepts the rounded annualized visit times as input (see Supplementary Material) and we observed no meaningful difference in results.

Experimental details

We implemented a randomized, diagnosis stratified 80–20 split of the data into train-test sets. We repeated this 80–20 split 200 times and all presented results are averaged over these repeats. For each 80–20 split, we also conducted a 5-fold cross-validation on the train sets, where the validation loss was used for early stopping. For each cross-validation fold and modeling choice, we trained 5 different models with different random initializations. Thus, for each test case, the final prediction is computed as the average of the 25 model predictions (5 cross-validation and 5 random initializations each). For the NMM model, we ensure that all longitudinal follow-up data for a participant is in the same partition. All three of our models use the same data splits.

The model architecture and hyperparameter values are fixed for NMM and single-year models, with the only difference being the missing input neuron of Δt in LSM and NSM. These choices were manually determined based on inspecting the validation loss in a single split. The architecture, illustrated in Fig 1, has 3 hidden layers with a width of 128, followed by 5 hidden layers of width 64 and 2 hidden layers of width 32. We perform early stopping during training based on the validation loss. We employ an L2 penalty loss on the weights and biases, with a weight of 10−6 in each hidden layer. We implement dropout after each hidden layer with a rate of 0.2. Our optimizer is Adam with a learning rate of 10−5. We use the softmax activation function at the output layer.

In order to explore the influence of the network architecture on results, we conduct two more experiments where we use the same training strategy, activation functions and hyperparameters as NMM, however we modify the architecture that is illustrated in Fig 1. The first experiment uses an elementary 3-layer nonlinear architecture with fewer parameters than NMM. The second experiment is a computationally expensive practice where we optimize the nonlinear architecture via a grid search over depth and width hyperparameters in each one of the 200 train/test splits. In this experiment, each test set has its optimal architecture, identified by the hyperparameter values that yield the best performance on a validation set. Results of both experiments are presented in S1 Fig. We note that the overall trend and pattern of the NMM’s results hold consistent across these architectural choices. The 3-layer model’s performance is slightly worse and the optimized architecture results are the best, as expected. The 11-layer model we present in Fig 1 under-performs slightly compared to the optimized architectures. We note that the 11-layer model was manually designed to optimize performance in a single train/validation split.

We analyze prediction accuracy in two different patient groups: CN at baseline and MCI at baseline. Therefore, each result we show has two parts, one corresponding to the CN-to-MCI conversion task and the other to the MCI-to-AD conversion task. In each task, clinical conversion is considered a positive event. For example, for the CN-to-MCI conversion task, a true positive sample refers to the subject progressing to MCI and the model predicting this correctly. Accordingly, the true positive rate is defined as the ratio of true positive samples against all converter subjects, and the false positive rate is the ratio of false positives against all stable subjects.

Due to the heavily unbalanced nature of the data, which can be seen in Table 2, we use the receiver operating characteristic (ROC) curve to inspect the performance of our models. Although there are no subjects who convert from CN to AD or from MCI to CN in our dataset, we do not implement any mechanism to prevent our models from making such predictions. Therefore, both CN-to-MCI conversion and MCI-to-AD conversion tasks are multi-class problems for our models. There are two different types of ROC analyses for multi-class problems: one-versus-one analysis and one-versus-rest analysis. As we demonstrate with our results, our models capture the disease progression dynamics sufficiently that both analyses give nearly identical ROC curves. In order words, the predicted AD probability for CN baseline subjects and the predicted CN probability for MCI baseline subjects are close to 0. Thus, we only share one-vs-rest results where the positive class for CN-to-MCI conversion is MCI and for MCI-to-AD conversion it is AD.

The area under the ROC curve (ROC AUC) is a scalar that summarizes the overall performance of a classifier. Our data has a time horizon of five follow-up years and we evaluate against annual diagnoses. For each of our three models, we compute an ROC AUC value corresponding to each follow-up year and each baseline group (CN baseline and MCI baseline). Therefore, each model has five ROC AUC values associated with each baseline group. To statistically compare the ROC AUC values achieved by two different models, we implement a pairwise permutation testing strategy, yielding a p-value for the null hypothesis that the two models’ predictions are indistinguishable. Our test statistic is the difference between the mean ROC AUC values (averaged over the annual follow-ups) for the two models in a given 80–20 train-test split. We then average this over all 200 random splits of our data. To create the null distribution of the test statistic, we randomly permute (105 times) the two models when computing the ROC AUC difference for each split. Finally, the normalized rank of the observed (unpermuted) test statistic value among all sorted (permuted) test statistic values yields the p-value, which we denote with ρ.


Impact of modeling choices

In CN-to-MCI conversion, we observe that there is a substantial difference between the linear and nonlinear models. For example, for the 1-year follow-up, LSM yields 83.88% ROC AUC, whereas its nonlinear counterpart, NSM, achieves 88.73%. This difference remains stable over all follow-up years and is statistically significant (ρ < 0.0001). The multi-year training strategy, on the other hand, further boosts prediction accuracy. For instance, for the 1-year follow-up, NMM achieves an ROC AUC of 90.40%. The difference with the NSM model is consistent over the follow-up years and statistically significant (ρ = 0.0001). Finally, we note that for CN-to-MCI conversion, all models tend to achieve worse performance as the time-horizon increases. For instance, the best-performing NMM model suffers more than a 6% drop in ROC AUC between 1- and 5-year follow-up predictions. This result suggests that it is easier to predict who will convert from CN to MCI in the relatively short term, say within a year, than in the longer term, say within 5 years.

We notice that the performance of LSM fluctuates as a function of the time horizon. There are two local minima, one at 2- and another at 4-year follow-up. This is likely because those two years include a higher percentage of CN subjects, due to the study design of ADNI 2 and 3, as can be seen in Table 2. We see that this affects the performance of the nonlinear single-year model, NSM, too. However, for the NMM the issue is mitigated, which is likely because the multi-year model can leverage the data from the other follow-up years to “smooth out” its predictions.

In MCI-to-AD conversion, there is an overall diminished difference between the performance of the three models. For the single-year models, the linear and nonlinear counterparts are statistically indistinguishable (ρ = 0.3735). The multi-year model, on the other hand, offers a statistically significant (ρ = 0.0004 against LSM, ρ = 0.0014 against NSM), yet subtle boost in ROC AUC, specifically for 1- and 2-year follow-ups. In the remaining follow-up years, all three models achieve essentially the same performance level. The most striking observation from the MCI-to-AD conversion results is that prediction accuracy improves for later years, and there is a very consistent increase in ROC AUC values across all modeling choices. This indicates that it is relatively easier to predict who will convert from MCI to AD in the 4–5 year horizon compared to the 1–2 year horizon. Fig 2 shows corresponding ROC curves of NMM for each follow-up year and each patient group.

Fig 2. ROC curves of NMM for CN-to-MCI and MCI-to-AD conversion in five-year time horizon.

Displayed are averages of 200 train-test splits.

Contribution of different biomarkers

As mentioned above, our models are capable of handling missing values in the input. This allows us to inspect the contribution of different data types to prediction accuracy. We perform this analysis on our best performing model, NMM, where we focus only on test participants with complete baseline data and systematically mask each input feature, treating it as missing.

Our baseline scenario is where only clinical data (CD) is available. Fig 4 shows the difference in AUC ROC (Δ AUC ROC) achieved with the utilization of additional biomarkers: FDG PET (a single global marker of sugar metabolism), CSF (global markers of tau and amyloid burden), AV45 PET (a single global marker of brain amyloid load), MRI volumetric measurements (markers of brain atrophy). For MRI, we consider two scenarios. First, we only use the value of hippocampus volume, normalized by the intracranial volume (ICV) (CD+ICV normalized Hippocampus size in Fig 4). In the second scenario, we use seven MRI-derived AD-associated biomarkers (CD+MRI in Fig 4). As a reference, we also show the results for including all available biomarkers in these test subjects that have complete baseline data (CD+All Biomarkers in Fig 4). We were not able to quantify the contribution of PIB PET, as only a very limited number of participants have PIB PET scans.

In CN-to-MCI conversion, molecular biomarkers (FDG, CSF, and AV45), by themselves, do not significantly improve performance over the baseline CD-only scenario, particularly beyond the 1-year follow-up (ρ = 0.1898 for CD+FDG, ρ = 0.2082 for CD+CSF, ρ = 0.3001 for CD+AV45). However, we observe a substantial accuracy boost when MRI data are available (ρ < 0.0001), much of which can be attributed to the hippocampus volume (ρ < 0.0001). All biomarkers combined achieve the highest ROC AUC values (ρ < 0.0001). The performance gain grows over the years, suggesting that additional biomarkers are more useful for making longer-term predictions.

Overall, the performance gain offered by additional biomarkers is relatively smaller for the easier MCI-to-AD conversion problem. Here, MRI markers add around 1% ROC AUC to the CD-only baseline. FDG consistently yields a greater boost than the MRI biomarkers in each follow-up year, which is in contrast to what we observe in CN-to-MCI conversion. Crucially, we find that hippocampus volume does not provide a statistically significant performance boost (ρ = 0.2412), while FDG and MRI markers improve the model performance subtly but significantly (ρ < 0.0001 for CD+FDG, ρ = 0.0024 for CD+MRI). Beyond year 1, CSF consistently outperforms AV45 (ρ < 0.0010 for CD+CSF, ρ = 0.0253 for CD+AV45), where the latter yields a boost on par with MRI. This highlights the potential importance of tau markers, particularly in the MCI stage. Overall, however, a striking observation is that the model that has access to all the biomarkers is substantially more accurate than a model with a single biomarker type.

Disease progression risk predictions

Even though we consider the problem as a three-label classification task for a given follow-up time, the underlying process can be viewed as a continuous evolution of MCI and AD dementia risk [57]. Using our NMM model, we can compute a prediction for arbitrary time horizons for the test subjects and interpret the output probabilities as a longitudinal estimation of risk. The softmax outputs of the MCI channel for CN baseline participants and AD channel for the MCI baseline participants are shown in Figs 5 and 6, respectively. We average these values over test subjects who have the same conversion time profile.

For individuals who remain stable CN throughout the 5-year follow-up period, we observe that NMM’s MCI prediction is consistently less than 50%. Intriguingly, for those stable subjects who were last observed earlier, the predicted MCI probabilities tend to be higher. In fact, for stable CN subjects last seen before the end of year 2, average predicted MCI probabilities exceed 50% around the year 4 mark. We emphasize that the model has no access to follow-up information, as the only input is baseline data. For subjects who convert to MCI at year 1, average predicted MCI probabilities exceed 0.5 before the first annual follow-up visit. Similarly, for those who convert around the second year, the average predicted MCI probabilities exceed 0.5, between years one and two. One notable exception is the group of individuals who progress to MCI at the third-year visit. In this group, the NMM prediction is that MCI conversion will happen, on average, at around the 5-year mark.

For the MCI baseline subjects, we observe similar patterns. For the stable subjects, the predicted AD probabilities remain under 0.5 until the last follow-up visit. For MCI-to-AD converters, the average predicted AD probability exceeds 0.5 before the AD diagnosis, except for the subjects who convert at the 5-year follow-up, where the average predicted AD probability is slightly below 0.5 at the 5-year mark. On the other hand, when we examine the timing of the average predicted conversion, it seems to be less accurate than with CN baselines. In most scenarios (e.g. conversion at 2, 3, and 4 years), the average predicted AD probability exceeds 0.5 before the corresponding time interval. This suggests that the NMM’s predictions tend to estimate an earlier MCI-to-AD prediction than observed.


In this work, we present an empirical study to characterize how predictable an individual subject’s future AD-associated clinical trajectory is, several years in advance, based on rich multi-modal data, and using modern deep learning methods. We present a novel machine learning strategy that can handle variable follow-up time queries, missingness patterns, and unbalanced class labels in the data, to make accurate predictions about the future decline in CN and MCI baseline participants.

Comparing the prediction accuracy for CN-to-MCI and MCI-to-AD conversions in Fig 3, our results verify that the CN-to-MCI conversion prediction is a harder task than the MCI-to-AD conversion prediction. On the other hand, we also confirm that more sophisticated modeling, such as a nonlinear multi-year (NMM) architecture, offers a larger boost for the harder CN-to-MCI conversion prediction task. This verifies that there is a bigger gap in performance between what a relatively simple model can achieve and the upper bound of what is achievable (also known as the Bayes-optimal performance) in the harder problem of CN-to-MCI conversion.

Fig 3. Predictive performance of different models for different follow-up years.

ROC AUC values are averaged across 200 80–20 data splits. Error bars indicate the standard error across these splits. LSM, Linear Single-year Model; NSM, Nonlinear Single-year Model; NMM, Nonlinear Multi-year Model.

Five years is a relatively short time window for studying CN-to-MCI conversion. On the other hand, in many real-world clinical scenarios, 5 years is a useful horizon to consider. Moreover, we note that at year 5, around 30% of the baseline CN subjects who remained in the study had converted to MCI, as we show in Table 2. Our analysis demonstrates that the prediction of CN-to-MCI conversion gets harder for distant time horizons, and we achieve higher accuracy for shorter time frames. This insight might be useful in detecting those CN subjects who might be on the cusp of developing MCI.

Despite the missingness in the data, Fig 4 suggests that NMM does not rely solely on a single modality. Additional biomarkers, in general, do not make the prediction performance worse. This finding is in parallel with the fact that multi-modal data, such as different MRI sequences and various PET tracers are often combined in the literature for predicting MCI-to-AD conversion [5860]. However, our results also demonstrate that the predictive value of each additional biomarker can vary. For example, for CN baseline participants, although there is a substantial accuracy increase with the use of MRI; molecular biomarkers (CSF, FDG, and AV45) do not offer a significant boost beyond the first year horizon. For the prediction of MCI-to-AD conversion, however, the situation is different—molecular biomarkers offer a significant boost. Furthermore, using the different MRI biomarkers together seems to be much more helpful for predicting MCI-to-AD conversion, rather than relying on a single MRI biomarker, namely the ICV-normalized hippocampal volume. These results highlight the importance of characterizing the diagnostic and predictive utility of different data types, at different stages of the disease process.

Fig 4. Δ ROC AUC values obtained with NMM by the addition of various biomarker combinations to the clinical data (participant demographics, clinical assessments, and cognitive assessments).

ROC AUC values are averaged across 200 80–20 data splits. Error bars indicate the standard error across these splits. +: Used together. CD, Clinical data; AV45, Florbetapir PET; CSF, Cerebrospinal Fluid; FDG, Fluorine-18-Fluorodeoxyglucose PET; MRI, Magnetic Resonance Imaging; ICV, Intracranial Volume.

One interpretation of the patterns of results we present in this study might be that amyloid or tau-associated biomarker changes have a relatively longer timecourse than MRI derived measurements, such as hippocampal volume. Furthermore, MRI markers may be less specific and reflect a multitude of effects that result in atrophy, particularly at later ages. Thus, MRI might predict more proximal decline from CN to MCI, but its utility will be less during the MCI stage, where tau/amyloid markers might offer some specific insights into the Alzheimer’s pathology dynamics that will play out over the next several years.

The conversion risk predictions that we show in Figs 5 and 6 suggest that NMM captures the continuous disease dynamics. However, NMM’s predictions are not always exactly aligned with the timing of events. This issue can be related to various biases in subject recruitment and follow-up in the ADNI [61]. For example, the data suffer from a “temporal bias” [62] that is caused by the fact that baseline visits are not distributed uniformly over latent disease stages. These shortcomings require further investigations, likely demanding novel methodological approaches that can address the selection and temporal biases in the data and possibly exploiting other cohorts, as in [63].

Fig 5. Conversion risk predictions of NMM for CN baseline participants with different ground truth disease trajectories.

Blue line is the average MCI conversion risk with 68% confidence. Red dots represent the observed diagnosis time (x-coordinate) and the observed diagnosis (y-coordinate) of the participants with the corresponding trajectory. Grey dots are for reference.

Fig 6. Conversion risk predictions of NMM for MCI baseline participants with different ground truth disease trajectories.

Blue line is the average AD conversion risk with 68% confidence. Red dots represent the observed diagnosis time (x-coordinate) and the observed diagnosis (y-coordinate) of the participants with the corresponding trajectory. Grey dots are for reference.


We have presented a machine learning approach that uses participants’ multimodal baseline data with arbitrary missingness, to predict their future diagnostic status at any time point. We have demonstrated that our model can capture disease progression dynamics and produce future conversion predictions that are highly accurate. Our analyses allow us to dissect the impact of modeling choices and input data types. We found that molecular biomarkers are more useful for predicting MCI-to-AD conversion than CN-to-MCI conversion. Our results show that MRI features are essential for both types of predictions, yet different types of MRI-derived measurements can be useful in different stages.

Supporting information

S1 Table. Performance of each model in terms of AUC ROC for CN baseline participants.

Data format is mean ± standard error. LSM is a standard linear ridge regression model that is an alternative implementation of LSM (Linear Single-year Model). NMM is a slight modification of NMM (Nonlinear Multi-year Model), where the input time-to-follow-up (Δt) feature is encoded as the closest annual visit time. NSM: Nonlinear Single-year Model.


S2 Table. Performance of each model in terms of AUC ROC for MCI baseline participants.

See caption of S1 Table.


S1 Fig. Predictive performance of NMMs with different architectures.

ROC AUC values are averaged across 200 80–20 data splits. Error bars indicate the standard error across these splits. NMM, Nonlinear Multi-year Model with the architecture shown in Fig 1; NMM (3-layer), Nonlinear Multi-year Model with a three-layer architecture; NMM (Optimized), Nonlinear Multi-year Model with optimized architectures for each test set. Details of NMM (3-layer) and NMM(optimized) can be found in S1 Text.


S1 Text. Details of NMM (3-layer) and NMM (optimized).

We use the same hyperparameters and activation functions for NMM (3-layer) and NMM (optimized) as NMM. NMM (3-layer) has an architecture consisting of 2 hidden layers with a width of 128 and an output layer of width 3. NMM (optimized) architectures for each test split are searched over a 3 × 3 grid, characterized by two parameters: width (W) and depth (D). W represents the number of neurons in the first hidden layer, and it can be either 64, 128, or 256. D represents the depth of the architecture in terms of equally wide blocks in Fig 1, i.e., a D of 1 means the architecture has 3 hidden layers of width W; a D of 2 means the architecture has 3 hidden layers of width W, followed by 5 hidden layers of width W/2; and a depth of 3 means the architecture has 3 hidden layers of width W, followed by 5 hidden layers of width W/2, followed by 2 hidden layers of width W/4. All architectures have an output layer with a width of 3. The best architecture is chosen by monitoring the validation loss in one of the train/validation splits.



The authors would like to acknowledge the anonymous reviewers for their valuable revision of the manuscript.

Data used in preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database Applications for ADNI data use can be submitted through the ADNI website at Others would be able to access the data in the same manner as the authors. The authors did not have any special access privileges that others would not have. The investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. Michael Weiner (E-mail: serves as the principal investigator for ADNI. A complete listing of ADNI investigators and their affiliations can be found below.

Michael Weiner4, Paul Aisen5, Ronald Petersen6, Clifford R. Jack Jr.6, William Jagust7, John Q. Trojanowki8, Arthur W. Toga9, Laurel Beckett10, Robert C. Green11, Andrew J. Saykin12, John Morris13, Leslie M. Shaw14, Enchi Liu15, Tom Montine16, Ronald G. Thomas5, Michael Donohue5, Sarah Walter5, Devon Gessert5, Tamie Sather5, Gus Jiminez5, Danielle Harvey10, Michael Donohue5, Matthew Bernstein6, Nick Fox17, Paul Thompson18, Norbert Schuff19, Charles DeCArli10, Bret Borowski6, Jeff Gunter6, Matt Senjem6, Prashanthi Vemuri6, David Jones6, Kejal Kantarci6, Chad Ward6, Robert A. Koeppe20, Norm Foster21, Eric M. Reiman22, Kewei Chen22, Chet Mathis23, Susan Landau7, Nigel J. Cairns13, Erin Householder13, Lisa Taylor Reinwald13, Virginia Lee24, Magdalena Korecka24, Michal Figurski24, Karen Crawford9, Scott Neu9, Tatiana M. Foroud12, Steven Potkin25, Li Shen12, Faber Kelley12, Sungeun Kim12, Kwangsik Nho12, Zaven Kachaturian26, Richard Frank27, Peter J. Snyder28, Susan Molchan29, Jeffrey Kaye30, Joseph Quinn30, Betty Lind30, Raina Carter30, Sara Dolen30, Lon S. Schneider31, Sonia Pawluczyk31, Mauricio Beccera31, Liberty Teodoro31, Bryan M. Spann31, James Brewer32, Helen Vanderswag32, Adam Fleisher22, Judith L. Heidebrink20, Joanne L. Lord20, Ronald Petersen6, Sara S. Mason6, Colleen S. Albers6, David Knopman6, Kris Johnson6, Rachelle S. Doody33, Javier Villanueva Meyer33, Munir Chowdhury33, Susan Rountree33, Mimi Dang33, Yaakov Stern34, Lawrence S. Honig34, Karen L. Bell34, Beau Ances35, John C. Morris35, Maria Carroll35, Sue Leon35, Erin Householder13, Mark A. Mintun35, Stacy Schneider35, Angela OliverNG36, Randall Griffith36, David Clark36, David Geldmacher36, John Brockington36, Erik Roberson36, Hillel Grossman37, Effie Mitsis37, Leyla de Toledo-Morrell38, Raj C. Shah38, Ranjan Duara39, Daniel Varon39, Maria T. Greig39, Peggy Roberts39, Marilyn Albert40, Chiadi Onyike40, Daniel D’Agostino II40, Stephanie Kielb40, James E. Galvin41, Dana M. Pogorelec41, Brittany Cerbone41, Christina A. Michel41, Henry Rusinek41, Mony J. de Leon41, Lidia Glodzik41, Susan De Santi41, P. Murali Doraiswamy42, Jeffrey R. Petrella42, Terence Z. Wong42, Steven E. Arnold14, Jason H. Karlawish14, David Wolk14, Charles D. Smith43, Greg Jicha43, Peter Hardy43, Partha Sinha43, Elizabeth Oates43, Gary Conrad43, Oscar L. Lopez23, MaryAnn Oakley23, Donna M. Simpson23, Anton P. Porsteinsson44, Bonnie S. Goldstein44, Kim Martin44, Kelly M. Makino44, M. Saleem Ismail44, Connie Brand44, Ruth A. Mulnard45, Gaby Thai45, Catherine Mc Adams Ortiz45, Kyle Womack46, Dana Mathews46, Mary Quiceno46, Ramon Diaz Arrastia46, Richard King46, Myron Weiner46, Kristen Martin Cook46, Michael DeVous46, Allan I. Levey47, James J. Lah47, Janet S. Cellar47, Jeffrey M. Burns48, Heather S. Anderson48, Russell H. Swerdlow48, Liana Apostolova49, Kathleen Tingus49, Ellen Woo49, Daniel H. S. Silverman49, Po H. Lu49, George Bartzokis49, Neill R. Graff Radford50, Francine Parfitt50, Tracy Kendall50, Heather Johnson50, Martin R. Farlow12, Ann Marie Hake12, Brandy R. Matthews12, Scott Herring12, Cynthia Hunt12, Christopher H. van Dyck51, Richard E. Carson51, Martha G. MacAvoy51, Howard Chertkow52, Howard Bergman52, Chris Hosein52, Sandra Black53, Bojana Stefanovic53, Curtis Caldwell53, Ging Yuek Robin Hsiung54, Howard Feldman54, Benita Mudge54, Michele Assaly Past54, Andrew Kertesz55, John Rogers55, Dick Trost55, Charles Bernick56, Donna Munic56, Diana Kerwin57, Marek Marsel Mesulam57, Kristine Lipowski57, Chuang Kuo Wu57, Nancy Johnson57, Carl Sadowsky58, Walter Martinez58, Teresa Villena58, Raymond Scott Turner59, Kathleen Johnson59, Brigid Reynolds59, Reisa A. Sperling60, Keith A. Johnson60, Gad Marshall60, Meghan Frey60, Jerome Yesavage61, Joy L. Taylor61, Barton Lane61, Allyson Rosen61, Jared Tinklenberg61, Marwan N. Sabbagh62, Christine M. Belden62, Sandra A. Jacobson62, Sherye A. Sirrel62, Neil Kowall63, Ronald Killiany63, Andrew E. Budson63, Alexander Norbash63, Patricia Lynn Johnson63, Thomas O. Obisesan64, Saba Wolday64, Joanne Allard64, Alan Lerner65, Paula Ogrocki65, Leon Hudson65, Evan Fletcher66, Owen Carmichael66, John Olichney66, Charles DeCarli66, Smita Kittur67, Michael Borrie68, T. Y. Lee68, Rob Bartha68, Sterling Johnson69, Sanjay Asthana69, Cynthia M. Carlsson69, Steven G. Potkin70, Adrian Preda70, Dana Nguyen70, Pierre Tariot22, Adam Fleisher22, Stephanie Reeder22, Vernice Bates71, Horacio Capote71, Michelle Rainka71, Douglas W. Scharre72, Maria Kataki72, Anahita Adeli72, Earl A. Zimmerman73, Dzintra Celmins73, Alice D. Brown73, Godfrey D. Pearlson74, Karen Blank74, Karen Anderson74, Robert B. Santulli75, Tamar J. Kitzmiller75, Eben S. Schwartz75, Kaycee M. SinkS76, Jeff D. Williamson76, Pradeep Garg76, Franklin Watkins76, Brian R. Ott77, Henry Querfurth77, Geoffrey Tremont77, Stephen Salloway78, Paul Malloy78, Stephen Correia78, Howard J. Rosen4, Bruce L. Miller4, Jacobo Mintzer79, Kenneth Spicer79, David Bachman79, Elizabether Finger80, Stephen Pasternak80, Irina Rachinsky80, John Rogers55, Andrew Kertesz55, Dick Drost80, Nunzio Pomara81, Raymundo Hernando81, Antero Sarrael81, Susan K. Schultz82, Laura L. Boles Ponto82, Hyungsub Shim82, Karen Elizabeth Smith82, Norman Relkin83, Gloria Chaing83, Lisa Raudin83, Amanda Smith84, Kristin Fargher84, Balebail Ashok Raj84

4 UC San Francisco, San Francisco, CA, USA. 5 UC San Diego, San Diego, CA, USA. 6 Mayo Clinic, Rochester, NY, USA. 7 UC Berkeley, Berkeley, CA, USA. 8 U Pennsylvania, Pennsylvania, CA, USA. 9 USC, Los Angeles, CA, USA. 10 UC Davis, Davis, CA, USA. 11 Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA. 12 Indiana University, Bloomington, IN, USA. 13 Washington University St. Louis, St. Louis, MO, USA. 14 University of Pennsylvania, Philadelphia, PA, USA. 15 Janssen Alzheimer Immunotherapy, South San Francisco, CA, USA. 16 University of Washington, Seattle, WA, USA. 17 University of London, London, UK. 18 USC School of Medicine, Los Angeles, CA, USA. 19 UCSF MRI, San Francisco, CA, USA. 20 University of Michigan, Ann Arbor, MI, USA. 21 University of Utah, Salt Lake City, UT, USA. 22 Banner Alzheimer’s Institute, Phoenix, AZ, USA. 23 University of Pittsburgh, Pittsburgh, PA, USA. 24 UPenn School of Medicine, Philadelphia, PA, USA. 25 UC Irvine, Newport Beach, CA, USA. 26 Khachaturian, Radebaugh & Associates, Inc and Alzheimer’s Association’s Ronald and Nancy Reagan’s Research Institute, Chicago, IL, USA. 27 General Electric, Boston, MA, USA. 28 Brown University, Providence, RI, USA. 29 National Institute on Aging/National Institutes of Health, Bethesda, MD, USA. 30 Oregon Health and Science University, Portland, OR, USA. 31 University of Southern California, Los Angeles, CA, USA. 32 University of California San Diego, San Diego, CA, USA. 33 Baylor College of Medicine, Houston, TX, USA. 34 Columbia University Medical Center, New York, NY, USA. 35 Washington University, St. Louis, MO, USA. 36 University of Alabama Birmingham, Birmingham, MO, USA. 37 Mount Sinai School of Medicine, New York, NY, USA. 38 Rush University Medical Center, Chicago, IL, USA. 39 Wien Center, Vienna, Austria. 40 Johns Hopkins University, Baltimore, MD, USA. 41 New York University, New York, NY, USA. 42 Duke University Medical Center, Durham, NC, USA. 43 University of Kentucky, city of Lexington, NC, USA. 44 University of Rochester Medical Center, Rochester, NY, USA. 45 University of California, Irvine, CA, USA. 46 University of Texas Southwestern Medical School, Dallas, TX, USA. 47 Emory University, Atlanta, GA, USA. 48 University of Kansas, Medical Center, Lawrence, KS, USA. 49 University of California, Los Angeles, CA, USA. 50 Mayo Clinic, Jacksonville, FL, USA. 51 Yale University School of Medicine, New Haven, CT, USA. 52 McGill Univ., Montreal Jewish General Hospital, Montreal, WI, USA. 53 Sunnybrook Health Sciences, Toronto, ON, Canada. 54 U.B.C. Clinic for AD & Related Disorders, British Columbia, BC, Canada. 55 Cognitive Neurology St. Joseph’s, Toronto, ON, Canada. 56 Cleveland Clinic Lou Ruvo Center for Brain Health, Las Vegas, NV, USA. 57 Northwestern University, Evanston, IL, USA. 58 Premiere Research Inst Palm Beach Neurology, West Palm Beach, FL, USA. 59 Georgetown University Medical Center, Washington, DC, USA. 60 Brigham and Women’s Hospital, Boston, MA, USA. 61 Stanford University, Santa Clara County, CA, USA. 62 Banner Sun Health Research Institute, Sun City, AZ, USA. 63 Boston University, Boston, MA, USA. 64 Howard University, Washington, DC, USA. 65 Case Western Reserve University, Cleveland, OH, USA. 66 University of California, Davis Sacramento, CA, USA. 67 Neurological Care of CNY, New York, NY, USA. 68 Parkwood Hospital, Parkwood, CA, USA. 69 University of Wisconsin, Madison, WI, USA. 70 University of California, Irvine BIC, Irvine, CA, USA. 71 Dent Neurologic Institute, Amherst, MA, USA. 72 Ohio State University, Columbus, OH, USA. 73 Albany Medical College, Albany, NY, USA. 74 Hartford Hosp, Olin Neuropsychiatry Research Center, Hartford, CT, USA. 75 Dartmouth Hitchcock Medical Center, Albany, NY, USA. 76 Wake Forest University Health Sciences, Winston-Salem, NC, USA. 77 Rhode Island Hospital, Rhode Island, USA. 78 Butler Hospital, Providence, RI, USA. 79 Medical University South Carolina, Charleston, SC, USA. 80 St. Joseph’s Health Care, Toronto, Canada. 81 Nathan Kline Institute, Orangeburg, SC, USA. 82 University of Iowa College of Medicine, Iowa City, IA, USA. 83 Cornell University, Ithaca, NY, USA. 84 University of South Florida, USF Health Byrd Alzheimer’s Institute, Tampa, FL, USA.


  1. 1. Organization WH. Dementia; 2021. Available from:
  2. 2. for Disease Control C, Prevention. Leading causes of death; 2022. Available from:
  3. 3. James BD, Bennett DA. Causes and Patterns of Dementia: An Update in the Era of Redefining Alzheimer’s Disease. Annual Review of Public Health. 2019;40(1):65–84. pmid:30642228
  4. 4. Breijyeh Z, Karaman R. Comprehensive Review on Alzheimer’s Disease: Causes and Treatment. Molecules. 2020;25(24). pmid:33302541
  5. 5. Munoz DG, Feldman H. Causes of Alzheimer’s disease. CMAJ. 2000;162(1):65–72. pmid:11216203
  6. 6. LaMontagne PJ, Benzinger TL, Morris JC, Keefe S, Hornbeck R, Xiong C, et al. OASIS-3: Longitudinal Neuroimaging, Clinical, and Cognitive Dataset for Normal Aging and Alzheimer Disease. medRxiv. 2019.
  7. 7. Malone IB, Cash D, Ridgway GR, MacManus DG, Ourselin S, Fox NC, et al. MIRIAD—Public release of a multiple time point Alzheimer’s MR imaging dataset. NeuroImage. 2013;70:33–36. pmid:23274184
  8. 8. Birkenbihl C, Westwood S, Shi L, Nevado-Holgado A, Westman E, Lovestone S, et al. ANMerge: A Comprehensive and Accessible Alzheimer’s Disease Patient-Level Dataset. Journal of Alzheimer’s Disease. 2021;79:423–431. pmid:33285634
  9. 9. Ellis KA, Bush AI, Darby D, De Fazio D, Foster J, Hudson P, et al. The Australian Imaging, Biomarkers and Lifestyle (AIBL) study of aging: methodology and baseline characteristics of 1112 individuals recruited for a longitudinal study of Alzheimer’s disease. International Psychogeriatrics. 2009;21(4):672–687. pmid:19470201
  10. 10. Beekly DL, Ramos EM, van Belle G, Deitrich W, Clark AD, Jacka ME, et al. The National Alzheimer’s Coordinating Center (NACC) Database: An Alzheimer Disease Database. Alzheimer Disease & Associated Disorders. 2004;18:270–277. pmid:15592144
  11. 11. Zhang X, Mormino EC, Sun N, Sperling RA, Sabuncu MR, Yeo BT, et al. Bayesian model reveals latent atrophy factors with dissociable cognitive trajectories in Alzheimer’s disease. Proceedings of the National Academy of Sciences. 2016;113(42):E6535–E6544. pmid:27702899
  12. 12. Kivipelto M, Mangialasche F, Ngandu T. Lifestyle interventions to prevent cognitive impairment, dementia and Alzheimer disease. Nature Reviews Neurology. 2018;14:653–666. pmid:30291317
  13. 13. Mangialasche F, Solomon A, Winblad B, Mecocci P, Kivipelto M. Alzheimer’s disease: clinical trials and drug development. The Lancet Neurology. 2010;9(7):702–716. pmid:20610346
  14. 14. Cummings J, Lee G, Ritter A, Sabbagh M, Zhong K. Alzheimer’s disease drug development pipeline: 2019. Alzheimer’s & Dementia: Translational Research & Clinical Interventions. 2019;5:272–293. pmid:31334330
  15. 15. Rosenberg PB, Mielke MM, Appleby BS, Oh ES, Geda YE, Lyketsos CG. The Association of Neuropsychiatric Symptoms in MCI with Incident Dementia and Alzheimer Disease. The American Journal of Geriatric Psychiatry. 2013;21(7):685–695. pmid:23567400
  16. 16. Feldman H, Scheltens P, Scarpini E, Hermann N, Mesenbrink P, Mancione L, et al. Behavioral symptoms in mild cognitive impairment. Neurology. 2004;62(7):1199–1201. pmid:15079026
  17. 17. Grueso S, Viejo-Sobera R. Machine learning methods for predicting progression from mild cognitive impairment to Alzheimer’s disease dementia: a systematic review. Alzheimer’s Research & Therapy. 2021;13. pmid:34583745
  18. 18. Chen Y, Denny KG, Harvey D, Farias ST, Mungas D, DeCarli C, et al. Progression from normal cognition to mild cognitive impairment in a diverse clinic-based and community-based elderly cohort. Alzheimer’s & Dementia. 2017;13:399–405.
  19. 19. Peavy GM, Jacobson MW, Salmon DP, Gamst AC, Patterson TL, Goldman S, et al. The Influence of Chronic Stress on Dementia-related Diagnostic Change in Older Adults. Alzheimer Disease & Associated Disorders. 2012;26:260–266. pmid:22037597
  20. 20. Popuri K, Balachandar R, Alpert K, Lu D, Bhalla M, Mackenzie IR, et al. Development and validation of a novel dementia of Alzheimer’s type (DAT) score based on metabolism FDG-PET imaging. NeuroImage: Clinical. 2018;18:802–813. pmid:29876266
  21. 21. Yee E, Popuri K, Beg MF. Quantifying brain metabolism from FDG–PET images into a probability of Alzheimer’s dementia score. Human Brain Mapping. 2019. pmid:31507022
  22. 22. Rathore S, Habes M, Iftikhar MA, Shacklett A, Davatzikos C. A review on neuroimaging-based classification studies and associated feature extraction methods for Alzheimer’s disease and its prodromal stages. NeuroImage. 2017;155:530–548. pmid:28414186
  23. 23. Ocasio E, Duong TQ. Deep learning prediction of mild cognitive impairment conversion to Alzheimer’s disease at 3 years after diagnosis using longitudinal and whole-brain 3D MRI. PeerJ Computer Science. 2021;7:e560. pmid:34141888
  24. 24. Pavisic IM, Nicholas JM, O’Connor A, Rice H, Lu K, Fox NC, et al. Disease duration in autosomal dominant familial Alzheimer disease. Neurology Genetics. 2020;6(5). pmid:33225064
  25. 25. Mueller SG, Weiner MW, Thal LJ, Petersen RC, Jack CR, Jagust W, et al. Ways toward an early diagnosis in Alzheimer’s disease: The Alzheimer’s Disease Neuroimaging Initiative (ADNI). Alzheimer’s & Dementia. 2005;1:55–66.
  26. 26. Donohue MC, Sperling RA, Salmon DP, Rentz DM, Raman R, Thomas RG, et al. The Preclinical Alzheimer Cognitive Composite. JAMA Neurology. 2014;71:961. pmid:24886908
  27. 27. Donohue MC, Sperling RA, Petersen R, Sun CK, Weiner MW, Aisen PS, et al. Association Between Elevated Brain Amyloid and Subsequent Cognitive Decline Among Cognitively Normal Persons. JAMA. 2017;317:2305–2316. pmid:28609533
  28. 28. Olsson A, Vanderstichele H, Andreasen N, De Meyer G, Wallin A, Holmberg B, et al. Simultaneous measurement of beta-amyloid(1-42), total tau, and phosphorylated tau (Thr181) in cerebrospinal fluid by the xMAP technology. Clinical Chemistry. 2005;51:336–345. pmid:15563479
  29. 29. Jellinger KA, Janetzky B, Attems J, Kienzl E. Biomarkers for early diagnosis of Alzheimer disease: ‘ALZheimer ASsociated gene’- a new blood biomarker? Journal of Cellular and Molecular Medicine. 2008;12:1094–1117. pmid:18363842
  30. 30. Jack CR, Bernstein MA, Fox NC, Thompson P, Alexander G, Harvey D, et al. The Alzheimer’s disease neuroimaging initiative (ADNI): MRI methods. Journal of Magnetic Resonance Imaging. 2008;27:685–691. pmid:18302232
  31. 31. Jack CR, Barnes J, Bernstein MA, Borowski BJ, Brewer J, Clegg S, et al. Magnetic resonance imaging in Alzheimer’s Disease Neuroimaging Initiative 2. Alzheimer’s & Dementia. 2015;11:740–756. pmid:26194310
  32. 32. Desikan RS, Ségonne F, Fischl B, Quinn BT, Dickerson BC, Blacker D, et al. An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. NeuroImage. 2006;31(3):968–980. pmid:16530430
  33. 33. Fischl B, van der Kouwe A, Destrieux C, Halgren E, Ségonne F, Salat DH, et al. Automatically Parcellating the Human Cerebral Cortex. Cerebral Cortex. 2004;14(1):11–22. pmid:14654453
  34. 34. Fischl B, Salat DH, van der Kouwe AJW, Makris N, Ségonne F, Quinn BT, et al. Sequence-independent segmentation of magnetic resonance images. NeuroImage. 2004;23(Supplement 1):S69–S84. pmid:15501102
  35. 35. Fischl B, Dale AM. Measuring the thickness of the human cerebral cortex from magnetic resonance images. Proceedings of the National Academy of Sciences of the United States of America. 2000;97(20):11050–11055. pmid:10984517
  36. 36. Fischl B, Liu A, Dale AM. Automated manifold surgery: constructing geometrically accurate and topologically correct models of the human cerebral cortex. IEEE Medical Imaging. 2001;20(1):70–80. pmid:11293693
  37. 37. Fischl B, Salat DH, Busa E, Albert M, Dieterich M, Haselgrove C, et al. Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron. 2002;33:341–355. pmid:11832223
  38. 38. Fischl B, Sereno MI, Tootell RBH, Dale AM. High-resolution intersubject averaging and a coordinate system for the cortical surface. Human Brain Mapping. 1999;8(4):272–284. pmid:10619420
  39. 39. Jovicich J, Czanner S, Greve D, Haley E, van der Kouwe A, Gollub R, et al. Reliability in multi-site structural MRI studies: Effects of gradient non-linearity correction on phantom and human data. NeuroImage. 2006;30(2):436–443. pmid:16300968
  40. 40. Kuperberg GR, Broome M, McGuire PK, David AS, Eddy M, Ozawa F, et al. Regionally localized thinning of the cerebral cortex in Schizophrenia. Archives of General Psychiatry. 2003;60:878–888. pmid:12963669
  41. 41. Rosas HD, Liu AK, Hersch S, Glessner M, Ferrante RJ, Salat DH, et al. Regional and progressive thinning of the cortical ribbon in Huntington’s disease. Neurology. 2002;58(5):695–701. pmid:11889230
  42. 42. Salat D, Buckner RL, Snyder AZ, Greve DN, Desikan RS, Busa E, et al. Thinning of the cerebral cortex in aging. Cerebral Cortex. 2004;14:721–730. pmid:15054051
  43. 43. Segonne F, Dale AM, Busa E, Glessner M, Salat D, Hahn HK, et al. A hybrid approach to the skull stripping problem in MRI. NeuroImage. 2004;22(3):1060–1075. pmid:15219578
  44. 44. Dale A, Fischl B, Sereno MI. Cortical Surface-Based Analysis: I. Segmentation and Surface Reconstruction. NeuroImage. 1999;9(2):179–194. pmid:9931268
  45. 45. Fischl B, Sereno MI, Dale A. Cortical Surface-Based Analysis: II: Inflation, Flattening, and a Surface-Based Coordinate System. NeuroImage. 1999;9(2):195–207. pmid:9931269
  46. 46. Han X, Jovicich J, Salat D, van der Kouwe A, Quinn B, Czanner S, et al. Reliability of MRI-derived measurements of human cerebral cortical thickness: The effects of field strength, scanner upgrade and manufacturer. NeuroImage. 2006;32(1):180–194. pmid:16651008
  47. 47. Sled JG, Zijdenbos AP, Evans AC. A nonparametric method for automatic correction of intensity nonuniformity in MRI data. IEEE Trans Med Imaging. 1998;17:87–97. pmid:9617910
  48. 48. Segonne F, Pacheco J, Fischl B. Geometrically accurate topology-correction of cortical surfaces using nonseparating loops. IEEE Trans Med Imaging. 2007;26:518–529. pmid:17427739
  49. 49. Reuter M, Rosas HD, Fischl B. Highly Accurate Inverse Consistent Registration: A Robust Approach. NeuroImage. 2010;53(4):1181–1196. pmid:20637289
  50. 50. Reuter M, Fischl B. Avoiding Asymmetry-Induced Bias in Longitudinal Image Processing. NeuroImage. 2011;57(1):19–21. pmid:21376812
  51. 51. Reuter M, Schmansky NJ, Rosas HD, Fischl B. Within-Subject Template Estimation for Unbiased Longitudinal Image Analysis. NeuroImage. 2012;61(4):1402–1418. pmid:22430496
  52. 52. Jagust WJ, Bandy D, Chen K, Foster NL, Landau SM, Mathis CA, et al. The Alzheimer’s Disease Neuroimaging Initiative positron emission tomography core. Alzheimer’s & Dementia. 2010;6:221–229.
  53. 53. Jagust WJ, Landau SM, Koeppe RA, Reiman EM, Chen K, Mathis CA, et al. The Alzheimer’s Disease Neuroimaging Initiative 2 PET Core: 2015. Alzheimer’s & Dementia. 2015;11:757–771.
  54. 54. Hartig M, Truran-Sacrey D, Raptentsetsang S, Simonson A, Mezher A, Schuff N, et al. UCSF FreeSurfer Methods; 2014. Available from:
  55. 55. Campos S, Pizarro L, Valle C, Gray KR, Rueckert D, Allende H. Evaluating Imputation Techniques for Missing Data in ADNI: A Patient Classification Study. In: Pardo A, Kittler J, editors. Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. Cham: Springer International Publishing; 2015. p. 3–10.
  56. 56. Wu Y, Zhang X, He Y, Cui J, Ge X, Han H, et al. Predicting Alzheimer’s disease based on survival data and longitudinally measured performance on cognitive and functional scales. Psychiatry Research. 2020;291:113201. pmid:32559670
  57. 57. Li D, Iddi S, Aisen PS, Thompson WK, Donohue MC. The relative efficiency of time-to-progression and continuous measures of cognition in presymptomatic Alzheimer’s disease. Alzheimer’s & Dementia: Translational Research & Clinical Interventions. 2019;5:308–318. pmid:31367671
  58. 58. Lin W, Tong T, Gao Q, Guo D, Du X, Yang Y, et al. Convolutional Neural Networks-Based MRI Image Analysis for the Alzheimer’s Disease Prediction From Mild Cognitive Impairment. Frontiers in Neuroscience. 2018;12. pmid:30455622
  59. 59. Pagani M, Nobili F, Morbelli S, Arnaldi D, Giuliani A, Öberg J, et al. Early identification of MCI converting to AD: a FDG PET study. European Journal of Nuclear Medicine and Molecular Imaging. 2017;44:2042–2052. pmid:28664464
  60. 60. Nozadi SH, Kadoury S. Classification of Alzheimer’s and MCI Patients from Semantically Parcelled PET Images: A Comparison between AV45 and FDG-PET. International Journal of Biomedical Imaging. 2018;2018:1–13. pmid:29736165
  61. 61. Mendelson AF, Zuluaga MA, Lorenzi M, Hutton BF, Ourselin S. Selection bias in the reported performances of AD classification pipelines. NeuroImage: Clinical. 2017;14:400–416. pmid:28271040
  62. 62. Yuan W, Beaulieu-Jones BK, Yu KH, Lipnick SL, Palmer N, Loscalzo J, et al. Temporal bias in case-control design: preventing reliable predictions of the future. Nature Communications. 2021;12.
  63. 63. Shishegar R, Cox T, Rolls D, Bourgeat P, Doré V, Lamb F, et al. Using imputation to provide harmonized longitudinal measures of cognition across AIBL and ADNI. Scientific Reports. 2021;11:23788. pmid:34893624