Deep learning to predict rapid progression of Alzheimer’s disease from pooled clinical trials: A retrospective study

The rate of progression of Alzheimer’s disease (AD) differs dramatically between patients. Identifying the most is critical because when their numbers differ between treated and control groups, it distorts the outcome, making it impossible to tell whether the treatment was beneficial. Much recent effort, then, has gone into identifying RPs. We pooled de-identified placebo-arm data of three randomized controlled trials (RCTs), EXPEDITION, EXPEDITION 2, and EXPEDITION 3, provided by Eli Lilly and Company. After processing, the data included 1603 mild-to-moderate AD patients with 80 weeks of longitudinal observations on neurocognitive health, brain volumes, and amyloid-beta (Aβ) levels. RPs were defined by changes in four neurocognitive/functional health measures. We built deep learning models using recurrent neural networks with attention mechanisms to predict RPs by week 80 based on varying observation periods from baseline (e.g., 12, 28 weeks). Feature importance scores for RP prediction were computed and temporal feature trajectories were compared between RPs and non-RPs. Our evaluation and analysis focused on models trained with 28 weeks of observation. The models achieved robust internal validation area under the receiver operating characteristic (AUROCs) ranging from 0.80 (95% CI 0.79–0.82) to 0.82 (0.81–0.83), and the area under the precision-recall curve (AUPRCs) from 0.34 (0.32–0.36) to 0.46 (0.44–0.49). External validation AUROCs ranged from 0.75 (0.70–0.81) to 0.83 (0.82–0.84) and AUPRCs from 0.27 (0.25–0.29) to 0.45 (0.43–0.48). Aβ plasma levels, regional brain volumetry, and neurocognitive health emerged as important factors for the model prediction. In addition, the trajectories were stratified between predicted RPs and non-RPs based on factors such as ventricular volumes and neurocognitive domains. Our findings will greatly aid clinical trialists in designing tests for new medications, representing a key step toward identifying effective new AD therapies.


Introduction
Alzheimer's disease (AD) is the sixth leading cause of death in the United States, affecting six million Americans aged 65 and older [1].Potential therapeutic targets have been sought, leading to various pharmaceutical interventions tested in randomized clinical trials (RCTs).Rapid progressors (RPs) are a subset of AD patients experiencing faster rates of cognitive decline than average AD patients within a defined period of time.Including RPs in RCTs presents challenges in accurately detecting treatment effects on the primary efficacy outcomes [2,3], potentially affecting AD clinical trials like EMERGE and ENGAGE, which investigated aducanumab [4].Identifying RPs before planning clinical trials and creating tailored trials can enable focused therapy development.
Predicting rapid progression, however, is challenging, as no universally agreed definition exists [5,6].For instance, the FDA recently recognized a definition for the CDR-SB that requires a change of over eight points within 18 months [7], but the justification is not well described.
The placebo arms of randomized clinical trial (RCT) data can complement prior studies of AD progression rates.In this study, we hypothesized that longitudinal changes during the initial months of observation could be used to extrapolate subsequent progression rates.We leveraged longitudinal observations from the placebo arms of three RCTs enrolling mild or mildto-moderate AD patients (EXPEDITION [18], EXPEDITION 2 [18], and EXPEDITION 3 [19]) to predict RPs.We developed deep recurrent neural networks with attention mechanisms to flexibly learn from multimodal information and predicted rapid progression on four request the data following the step-by-step process on the Vivli Platform (https://vivli.org/ourmember/lilly/).Code for building, training, and evaluating the models is available on the GitHub repo (https:// github.com/xiaotian0328/AD-Rapid-Progression-Prediction).Specific frameworks and libraries we used are reported in S5 Table .Funding: PS is funded by the McCord Family Professorship in Neurology, the Umphrey Family Professorship in Neurodegenerative Disorders, multiple NIH grants and several foundation grants (1R01AG080137-01A1, 1RF1AG072491-01, 1R03AG077191-01, 1U01AG079847-01A1, 5R01AG66749-03, 1R01AG083039-01, 5R01AG067498-03, 1RF1AG055053-0A1, 1R01AG062690-01, 1R01AG059321-01A1, 1R01DE07027, 1R01AG082721-01, 2124789, AGT002985, AGT008724, AGT009122), and contracts with multiple pharmaceutical companies related to the performance of clinical trials (ALZ-commonly used tests of neurocognition and/or functional impairment: the 14-item cognitive subscale of the Alzheimer's Disease Assessment Scale (ADAS-Cog14) [20,21], the Alzheimer's Disease Cooperative Study Activities of Daily Living Inventory (ADCS-ADL) [22], the Clinical Dementia Rating Scale-Sum of Boxes (CDR-SB) [23,24], and the Mini-Mental State Examination (MMSE) [25].We used early observations (initial 28 weeks based on available data) to predict RPs at week 80, analyzed important features, and plotted AD progressive trajectories to capture significant predictors aligned with clinical knowledge.Our work contributes to the field by (1) building models based on placebo-arm RCT data to predict four definitions of RPs; (2) accurately predicting RPs using early observations (28 weeks) in RCTs; (3) identifying the most salient features for our models to predict RPs; and, (4) stratifying AD progressive trajectories using data-driven methods.

Ethical approval
This UTHealth Institutional Review Board and the Committee for the Protection of Human Subjects (CPHS) reviewed this study, classifying it as "non-human subjects research" due to the use of de-identified retrospective data.Therefore, the study was approved with a waiver of HIPAA authorization and informed consent waiver.

Overview
We employed deep learning models to predict RPs at week 80 using pooled RCT data, including demographics, comorbidities, baseline conditions, and the initial observations from baseline (Fig 1).We performed internal and external validation, analyzed relative feature importance, and visualized the average changes of important variables between true-positive (RP) and true-negative (non-RP) subjects over time to evaluate the model and substantiate clinical insights.The code for building, training, and evaluating the models can be found on GitHub (https://github.com/xiaotian0328/AD-Rapid-Progression-Prediction).

Variables
Since the cause of AD is unclear and many possible factors may contribute to its development, we opted to include as many variables as relevant to predict RPs.We employed deep learning models to decide on significant variables instead of assuming a priori their importance.Hence, our models integrated 151 variables (S1 Table ), such as demographics, comorbidities, neurocognitive measurements, imaging results, and amyloid-beta (Aβ) ratios.The baseline conditions included 22 demographic and 42 comorbid variables, while 87 variables had longitudinal measurements from baseline to 80 weeks.The longitudinal variables included (1) 73 After pooling the trials and performing data processing, there were 1603 patients in the cohort.We compared the results of models using four different definitions of RP, which were based on changes in scores on ADAS-Cog14, ADCS-ADL, MMSE, and CDR-SB from baseline to week 80.We compared various observation periods from baseline up to week 80.We used the GRU with attention mechanism to learn the likelihood of RP from longitudinal and multimodal observations.After training and evaluating the models, we calculated feature importance to determine important modalities and features for prediction of RPs, and we plotted the trajectories of selected variables among RPs.CDR-SB = Clinical Dementia Rating Scale-Sum of Boxes; ADAS-Cog14 = 14-item cognitive subscale of the Alzheimer neurocognition, neuropsychiatric symptom, and daily functioning measurements, (2) 6 brain regional volume measurements on MRI (e.g., volumes of hippocampi, entorhinal cortices, ventricles, and whole brain), (3) 1 amyloid PET imaging variable for the composite summary standard uptake value ratios (SUVRs), (4) 1 amyloid-beta (Aβ) plasma level ratio variable, and (5) 6 vital signs and anthropometry variables.The longitudinal variables contained both baseline measurements as well as the subsequent measurements at later weeks (i.e., weeks 4, 12, 28, 40, 52, 64, and 80).We aggregated components of the three neurocognitive tests into 6 cognitive domains: complex attention, executive function, language, learning and memory, perceptualmotor function, and social cognition (S2 Table ).

Data preprocessing
Since the three trials shared similar eligibility criteria and objectives, we pooled and preprocessed the clinical data collectively.We reviewed the respective protocols [18,19] and case report forms (CRFs) to confirm data collection similarities.Variable names were manually matched and different units were harmonized before longitudinal observations were processed.
Upon meeting the inclusion criteria, patients were enrolled and initial visits (week 0) were conducted.Variable values from the first encounters (screening or week 0) were set as baseline conditions, including time-variant features such as cognitive tests, MRI results, etc. Regular visits continued at least every four weeks for 80 weeks and we observed longitudinal biomarker changes in brain volumetry and plasma level Aβ 1-42 /Aβ 1-40 ratio, as well as other markers such as cognitive/psychological scores.Longitudinal variable measurements were collected at weeks 0, 4, 12, 28, 40, 52, 64, and 80 (referred to as "observation periods" and cumulative; week 12 includes data from weeks 0, 4, and 12, week 28 includes weeks 0, 4, 12, and 28 data, etc.).The intervals between observation periods were prespecified in the structured protocols of the EXPEDITION trials from which our data were sourced.Such observation periods were designed with clinical relevance, as longer and later intervals are reflective of the expected gradual progression of AD.Hence, some time points were unavailable.We normalized the volumetric measurements of distinct brain regions by the total brain volume before applying a min-max normalization to the brain volumes within each trial to mitigate potential discrepancies arising from varying methodologies employed for the volume quantification [27].For time-invariant variables (i.e., demographics, comorbidities, etc.), missing values were imputed using the Multiple Imputation by Chained Equation (MICE) [28] for demographics and baseline conditions and the most common class for comorbidities.Missing values in longitudinal variables were also imputed by MICE, assuming missing at random.We avoided temporal data leakage by imputing data within the observed periods according to the specific observation periods for each model.S1 Appendix and S1 Fig report details regarding how we evaluated and selected from several imputation strategies.Finally, we standardized numeric variables to zero mean and unit variance.

Observation periods
We trained prediction models with data from the first T weeks (T = 4, 12, 28, 40, 52, 64) (Fig 1 ), comparing them to baseline measurement models.The models used varying observation periods and longitudinal variables (refer to "Data Preprocessing") to determine cognitive progression.

Prediction model
We developed rapid progression prediction models using the first N observations (corresponding to first T weeks) of the longitudinal data, combining the gated recurrent unit (GRU) [29] and an attention mechanism for longitudinal inputs (Fig 1).The GRU is a gated mechanism used in recurrent neural networks with an update gate and a reset gate to control the flow of information from the previous hidden state and the current inputs.In addition, the attention mechanism can learn to assign larger weights to more important time steps, which would provide more information (S1 Appendix, S3 Fig) .Demographics, comorbidities, baseline values, and changes from baseline formed the n-th sequential input.For comparison, we implemented simple two-layer multilayer perceptrons (MLPs) as baseline models using only demographics, comorbidities, and baseline measurements for the longitudinal variables.S1 Appendix provides more details about the models.We chose to train and test our models using the 28-week observation period as it is the earliest time point with significant data to predict RPs.This allows us to effectively capture early-stage trial data and accurately predict RPs at week 80.

Evaluation
We evaluated the models using internal and external hold-out validation combined with 10-fold cross-validation.We split the pooled data (n = 1603) into a training set for 10-fold cross-validation (n = 1282, 80%) and a hold-out test set (n = 321, 20%).For external validation, we set aside the EXPEDITION 3 trial as the hold-out test set (n = 844), considering its different inclusion criteria, and treated the pooled EXPEDITION and EXPEDITION 2 trials (n = 759) as the training set for 10-fold cross-validation.We then split the training sets in both internal and external validation into 10 folds, reserving 1-fold for validation to select the best models and using the remaining 9 folds for model training each time.
We assessed prediction accuracy using the area under the receiver operating characteristic curve (AUROC) and the area under the precision and recall curve (AUPRC) (Fig 2).The AUROC measures the probability that the predicted scores for the ground-truth positive samples are larger than those of the negative samples, with 0.5 as the baseline referring to a random classifier.The AUPRC measures the trade-off between precision (fraction of true positive among predicted positive) and recall (fraction of true positive among ground-truth positive), with a higher AUPRC indicating the model can identify most of the positive samples without classifying too many negative samples as positive.The baseline value for AUPRC is the positive rate of the data, approximately 10% in our case.We reported the averaged AUROCs and AUPRCs over 10 folds for all trained models and calculated their 95% confidence intervals via 1:96 � SD= ffi ffi ffi n p , where SD is the standard deviation of a metric over 10 folds, and n = 10 is the number of folds.To further ensure that our probabilistic predictions align well with the actual outcome distributions and that our models are reliable in clinical usage, we calibrated the model predictions and reported the calibration method and results in S1 Appendix and S4 Fig.
To further evaluate our model's binary prediction performances, we also displayed the confusion matrices and F1 scores for all models (Fig 3).The confusion matrix displays numbers of true negative (TN), false positive (FP), false negative (FN), and true positive (TP) in a 2×2 matrix, while the F1 score is the harmonic mean of precision and recall calculated as As we ran 10-fold cross-validation, we selected the classification thresholds for each fold to trade off the sensitivity and specificity on the validation set.Specifically, the thresholds were selected when sensitivity was larger than specificity with a margin ensuring a good sensitivity with a reasonable specificity (set to 0.15 in our case).Then, we had 10 binary predictions on the hold-out test set based on the selected thresholds.The final binary predictions were determined by the majority over the 10 predictions for all the models.The confusion matrices and F1 scores were computed based on the final predictions, with confusion matrices normalized over rows to show specificities in the upper left boxes of matrices and sensitivities in the lower right boxes of matrices (Fig 3).

Interpretation
We calculated feature contributions to RP prediction using Shapley Additive Explanation (SHAP) [30] values (Fig 4 , S1 Appendix) and grouped features into seven modalities: demographics, comorbidities and medications, plasma levels of Aβ, brain regional volumes, cognitive tests (ADAS-Cog14, CDR, MMSE), quality of life (ADCS-ADL, EQ-5D, QoL-AD), and neuropsychiatric symptoms (NPI).SHAP measures the importance score for each variable, indicating how much each variable contributes to the model's prediction for a specific instance.We reported the mean absolute SHAP values over 10 folds and calculated the 95% confidence intervals as we did in calculating AUROCs and AUPRCs.We also investigated modality-and feature-level interpretations using the 28-week period and visualized patient progression for important features, comparing predicted positive (RP) and predicted negative (non-RP) samples, with ground-truth positive and ground-truth negative as references (Fig 5, S1 Appendix).The thresholds to determine positive and negative samples were selected by the method described in the "Evaluation" section.

Data
After preprocessing and removing 494 patients with missing observations, we finalized a pooled sample of 1,603 patients from EXPEDITION (n = 368), EXPEDITION 2 (n = 391), and EXPEDITION 3 (n = 844).Patient characteristics, lab values, and scans are in Table 1.

Rapid progressors
As noted under Methods, RP thresholds were defined as the greatest 10th percentile of changes in cognitive scores from baseline to week 80, which were: ADAS-Cog14 total change > 22 (n = 147, 9.2%), ADCS-ADL total change < -24 (n = 152, 9.5%), CDR-SB total change > 6 (n = 150, 9.4%), or MMSE total change < -10 (n = 145, 9.0%) (S3 Table ).After careful consideration of all relevant data per participant, 305 of 1603 patients were classified as rapid progressors by at least one of the four definitions.Using a Gaussian Mixture Model, the 10% thresholds stand on the tail of the major Gaussian component with a probability density close to 0 (S1 Fig).
Prediction accuracies for the four RP definitions varied.Models predicting ADAS-Cog14-defined RPs using week 28 data had the highest AUROCs and better AUPRCs than other definitions.In contrast, models using only baseline data or data up to week 12 performed better when RP was defined by the change in the MMSE.However, observations up to weeks 4 or 12 did not provide enough information for an accurate prediction as AUROCs and AUPRCs were below a threshold deemed sufficient for clinical utility at these time points.Clinically, observations at week 40 and beyond are too late in the RCT to be meaningful.Therefore, the following analysis and discussion focus on observation periods up to 28 weeks.
In addition, our models reached F1 scores ranging from 0.29 to 0.56 overall on the internal hold-out test set, while the models using data up to week 28 had F1 scores from 0.35 to 0.41

Baseline conditions
Measures of neurocognitive and/or functional impairment (mean (± SD)) ADAS-Cog14 32.3 (±10.0)33.9 (±10.5)28.9 (±8.1) 30.9 (±9.5)ADCS-ADL 62.5 (±10.9)61.5 (±12.0)67.4 (±8.5) 64.9 (±10.(Fig 3).The confusion matrices of 28-week models show sensitivities ranging from 0.74 to 0.88, with specificities ranging from 0.7 to 0.8, indicating good stratification between RPs and non-RPs.We visualized the SHAP values by modalities and important variables within these modalities that contribute to the models' prediction using week 28 data (Fig 4).The most important modalities were the plasma Aβ 1-42 /Aβ 1-40 ratio, brain volumetry, and neurocognitive measures.Comorbidities and medications were the least important.Dysfunction in important cognitive domains such as complex attention, learning and memory, and language contributed to the prediction of RPs (Fig 4B , 4E, 4H, 4K), with most of the important functions being those measured by MMSE.Brain ventricle volume and left hippocampal volume were among the top three most important imaging biomarkers in all the models (Fig 4C , 4F, 4I, 4L), while brain ventricle volume was the most important volumetric feature for ADCS-ADL, CDR-SB, and MMSE models.
We selected the three most common and important cognitive domains (complex attention calculated by ADAS-Cog14, language calculated by MMSE, and learning and memory calculated by MMSE), the brain volumetry measurements (left hippocampus and brain ventricle), and the plasma Aβ 1-42 /Aβ 1-40 ratio for each model and reported the mean and confidence intervals of the longitudinal changes in those features from baseline to Week 80 (Fig 5).We focused on the divergence between predicted positives (predicted by the model as RP) and predicted negatives (predicted by the model as non-RP), with ground-truth positives and groundtruth negatives as references.Fig 5,then, shows an ideal match between predicted negative and ground-truth negative groups.However, the predicted positive groups are generally underfitted to the ground-truth positive groups.Nonetheless, Fig 5 still displays overall clear stratification between predicted positive and negative groups.

Discussion
We developed models to identify rapid progressors (RPs) using pooled data from completed Phase III RCTs of mild-to-moderate AD.The purpose of these models was to predict rapid symptom progression in patients.We found that the plasma Aβ 1-42 /Aβ 1-40 ratio, brain volumetry, and specific neurocognitive measurements were the most important features within and across models.Our models highlight key features that support previous evidence.
Our longitudinal models achieved higher accuracies with longer observation periods.The 12-week model had similar accuracy to the baseline model, suggesting that 12 weeks might be insufficient for identifying RPs.Longitudinal models with 28-week or longer observation periods had higher accuracies, with the potential for even greater accuracy with more observations.We also observed the magnitude of variable changes capturing the progression better than absolute values at any given time point [31].
Figs 4 and 5 show that the three most important predictors of RP were plasma Aβ 1-42 /Aβ 1-40 ratio, brain volumetry, and specific cognitive decline areas (e.g., complex attention, learning and memory, language), corroborating the National Institute on Aging-Alzheimer's Association's ATN (amyloid, tau, and neurodegeneration) classification [32].As other studies have shown, plasma levels of Aβ are correlated with the presence of CSF Aβ [33], and the Aβ 1-42 /Aβ 1-40 plasma level ratio is a robust measurement that correlates with the presence of amyloid plaques-an important indicator for AD diagnosis [34].The clinical presentation of AD progression is reflected in cognitive decline and the brain atrophy reflects neurodegeneration, which is indicated by changes of brain volumetry [35].
Plasma Aβ levels had an impact on predicting RPs (Fig 4A , 4D, 4G, 4J) with predicted positive RPs showing increased fluctuation in longitudinal progression of plasma level Aβ 1-42 /Aβ 1-40 ratio (Fig 5U -5X).This indicates its potential utility in differentiating rapidly progressing AD patients from those with a slower progression.
An additional biomarker identified by the models as important for RP prediction was changes in the volume of the hippocampi, atrophy of which is closely associated with cognitive impairment [36].Furthermore, the most important brain volume predictor for RP in ADCS-ADL, CDR-SB, and MMSE models was the trajectory of ventricular volumes, which is consistent with the compensatory enlargement of the ventricles caused by the atrophy and volume loss of brain tissue in neurodegenerative disorders such as AD (i.e., hydrocephalus ex vacuo) [37].The trajectories of brain volumetry (Fig 5M -5T) were suggestive of slightly lower hippocampal volumes and greater ventricular volumes among predicted positive RPs compared to predicted negative RPs, but no significant divergence between these groups was observed (even between groups of ground-truth positive and ground-truth negative RPs), which warrants further investigation.
Lastly, we evaluated the importance of neurocognitive domains for the four cognitive test definitions (Fig 4B , 4E, 4H, 4K) and aimed to identify the most crucial domains and tests for early detection of RPs.A comprehensive analysis of these domains and tests will be presented in a follow-up study.
Overall, AD progression prediction has been challenging.The TADPOLE challenge [17], a community effort using Alzheimer's Disease Neuroimaging Initiative (ADNI) data, aimed to develop a model predicting AD progression; however, none of the algorithms proved to be sufficient for prediction.A recent publication [38] employed machine learning techniques on ADNI and clinical trial datasets to predict longitudinal progression using the CDR-SB; however, the CDR is more labor-intensive than other cognitive tests.Our study shows that the MMSE is, for the most part, more consistent in AUROCs and AUPRCs and supports the assertion that other tests or even a combination of tests may be more sensitive at detecting and predicting progression.
Our study also developed an attention-based GRU that integrated a wide range of multimodal temporal data and harnessed the advantages of pooled RCT data to achieve better AUR-OCs and AUPRCs, providing a framework for clinicians and clinical trialists to screen patients for appropriate drug trials.Moreover, by identifying RPs and distinguishing them from non-RPs, we can enhance the balance between placebo and treated groups, allowing for more accurate assessments of treatment efficacy and potentially improving the development of targeted interventions for this unique patient population.
Limitations of this study include the inability to consider certain biological variables due to the complexities and uncertainties surrounding AD's pathoetiologies [5], as well as defining RPs based on clinical neuro-psych tests and structural imaging instead of biological or neurocognitive evidence.Differences in the clinical trial inclusion and exclusion criteria could also influence this study population.For example, EXPEDITION and EXPEDITION 2 did not require CSF collection or amyloid PETs for study eligibility, and as a result, some patients may have had other etiologies for their dementia.This is important because we do not rely on population statistics and trends but rather on individual patient-level RCT data.Additionally, the systematic difference in data distribution by sources may limit the generalizability of our model to other RCTs with similar inclusion and exclusion criteria.Future studies will need to utilize a dataset based on clinical (e.g., EMR) rather than RCT data.It is possible that clinical trials may have a batch effect based on eligibility criteria and study sites.
To mitigate reproducibility issues, we performed external validation (with data from EXPE-DITION 3 as the external hold-out test set).Fortunately, the analysis we used in our model is intended to capture components of standard protocols for clinical trials in order to ensure generalizability to other clinical trials' data.The main objective of this study was to develop a predictive model using the clinical trial gold standard of available data-placebo arms-that can be extrapolated to data sources outside RCTs.
Furthermore, we propose potential future directions that are worth exploring.(1) Survival analysis could be explored by treating RP as a time-to-event variable, implementing a dynamic prediction model that can be updated over time [39], and evaluating the models by timedependent AUROC.(2) More imputation methods could be explored on our baseline and longitudinal data, such as the data-driven missing value imputation approach [40], HyperImpute [41], or integrating the data imputation module into the prediction models [16].(3) Investigations could be conducted by using adapted supervised machine learning algorithms [42,43] and comparing them to our neural network models in terms of performances and efficiency for longitudinal classification.(4) There is the potential for the model to be tuned in the future to include and assess other genetic components that are predictive of a rapid progressor early on [44,45].(5) To capture more variations of the data, we could include more cohorts of AD trials or clinical data (i.e., ADNI datasets) as our internal and external validation, as well as implement multi-modal, multi-task deep learning models on different cohorts.Additionally, more advanced deep learning architectures, such as autoencoders and transformers, could be used to capture inherent variations of the tabular data.

Conclusion
Our research represents a breakthrough in predicting the rapid progression of AD.By analyzing pooled placebo-arm data from three randomized controlled trials, we developed advanced deep learning models capable of accurately identifying rapidly progressive AD cases.
A critical discovery of our research was the identification of key biomarkers and neurocognitive health indicators that are instrumental in predicting RPs.These include the plasma level Aβ 1-42 /Aβ 1-40 ratio, regional brain volumetry, and specific areas of cognitive decline.These findings are important in the context of AD therapy development, as they might empower researchers to design clinical trials that more effectively account for the diverse ways in which AD symptoms can manifest and progress in patients.
In summary, our research represents a significant advancement in understanding and predicting the rapid progression of AD.The utilization of deep learning models derived from comprehensive data across multiple trials has enabled us to identify crucial factors in AD progression.This enhances our ability to predict which patients are more likely to experience rapid progression.The implications of these findings are profound, paving the way for more targeted and effective clinical trials.This, in turn, promotes the development of AD therapies that are better suited to address the varied trajectories of symptom progression in patients, ultimately leading to more effective treatment strategies for this complex and very distressing disorder.

Fig 1 .
Fig 1. Study overview for RP prediction.Our cohort consisted of 2097 patients from 3 randomized control trials: EXPEDITION, EXPEDITION 2, and EXPEDITION 3.After pooling the trials and performing data processing, there were 1603 patients in the cohort.We compared the results of models using four different definitions of RP, which were based on changes in scores on ADAS-Cog14, ADCS-ADL, MMSE, and CDR-SB from baseline to week 80.We compared various observation periods from baseline up to week 80.We used the GRU with attention mechanism to learn the likelihood of RP from longitudinal and multimodal observations.After training and evaluating the models, we calculated feature importance to determine important modalities and features for prediction of RPs, and we plotted the trajectories of selected variables among RPs.CDR-SB = Clinical Dementia Rating Scale-Sum of Boxes; ADAS-Cog14 = 14-item cognitive subscale of the Alzheimer's Disease Assessment Scale; MMSE = Mini-Mental State Examination; ADCS-ADL = Alzheimer's Disease Cooperative Study-Activities of Daily Living Inventory; GRU = Gated Recurrent Unit; FC = Fully Connected layer; RP = Rapid Progression; SHAP = SHapley Additive exPlanations.
Fig 1. Study overview for RP prediction.Our cohort consisted of 2097 patients from 3 randomized control trials: EXPEDITION, EXPEDITION 2, and EXPEDITION 3.After pooling the trials and performing data processing, there were 1603 patients in the cohort.We compared the results of models using four different definitions of RP, which were based on changes in scores on ADAS-Cog14, ADCS-ADL, MMSE, and CDR-SB from baseline to week 80.We compared various observation periods from baseline up to week 80.We used the GRU with attention mechanism to learn the likelihood of RP from longitudinal and multimodal observations.After training and evaluating the models, we calculated feature importance to determine important modalities and features for prediction of RPs, and we plotted the trajectories of selected variables among RPs.CDR-SB = Clinical Dementia Rating Scale-Sum of Boxes; ADAS-Cog14 = 14-item cognitive subscale of the Alzheimer's Disease Assessment Scale; MMSE = Mini-Mental State Examination; ADCS-ADL = Alzheimer's Disease Cooperative Study-Activities of Daily Living Inventory; GRU = Gated Recurrent Unit; FC = Fully Connected layer; RP = Rapid Progression; SHAP = SHapley Additive exPlanations.https://doi.org/10.1371/journal.pdig.0000479.g001

Fig 2 .
Fig 2. AUROCs and AUPRCs of RP prediction models using different definitions and observation periods (including baselines).Results include two validation modes: internal (to train and test with pooled data) and external (trained with EXPEDITION and EXPEDITION 2 and tested with EXPEDITION 3).We trained the models with 10-fold cross-validation and tested the trained models on the hold-out datasets for both evaluation settings and reported mean AUROCs and AUPRCs over the 10 folds.The 95% confidence intervals were calculated by 1:96 � SD= ffi ffi ffi n p , where SD is the standard deviation over 10 folds and n = 10 is the number of folds.https://doi.org/10.1371/journal.pdig.0000479.g002

Fig 3 .Fig 4 .Fig 5 .
Fig 3. Confusion matrices and F1 scores of RP prediction models using different definitions and observation periods.Results include longitudinal models with internal validation.Confusion matrices are nomalized over rows, and corresponding F1 scores are reported under the matrices.Each normalized confusion matrix displays specificity in the upper left box and sensitivity in the lower right box.https://doi.org/10.1371/journal.pdig.0000479.g003