Regression techniques employing feature selection to predict clinical outcomes in stroke

It is not fully clear which measurable factors can reliably predict chronic stroke patients’ recovery of motor ability. In this analysis, we investigate the impact of patient demographic characteristics, movement features, and a three-week upper-extremity intervention on the post-treatment change in two widely used clinical outcomes—the Upper Extremity portion of the Fugl-Meyer and the Wolf Motor Function Test. Models based on LASSO, which in validation tests account for 65% and 86% of the variability in Fugl-Meyer and Wolf, respectively, were used to identify the set of salient demographic and movement features. We found that age, affected limb, and several measures describing the patient’s ability to efficiently direct motions with a single burst of speed were the most consequential in predicting clinical recovery. On the other hand, the upper-extremity intervention was not a significant predictor of recovery. Beyond a simple prognostic tool, these results suggest that focusing therapy on the more important features is likely to improve recovery. Such validation-intensive methods are a novel approach to determining the relative importance of patient-specific metrics and may help guide the design of customized therapy.


Introduction
Recovering from stroke is a highly variable process [1] that is difficult to influence or predict. There are many clinical assessments to evaluate the state of a patient and gauge his or her long-term prognosis. Some assessments are sufficiently reliable [2], though there is no widely accepted gold standard [3,4]. In practice, a battery of clinical evaluations are conducted, each used to assess a different aspect of a patient's condition. Common assessment areas include: (1) motor ability, such as Fugl-Meyer [5]; (2) functional performance, such as Wolf Motor Function Test [6]; and (3) self-reported motor activity, as in the case of the Motor Activity Log [7] and the Functional Independence Measure [8]. There is no consolidated outcome measure PLOS   Patients reach bimanually to two targets, pseudorandomly placed at one of four possible locations in the workspace. Patients return to a central "Home" position after every center-out reach. Patient's wrists are represented by red spheres. Their task is to get the red spheres inside the yellow targets at the same time. (B) Patients underwent six treatment sessions over two weeks. They were evaluated prior to and immediately after training, as well as one week post-training. Our goal is to use the initial assessment, clinical, and demographic information Table 1 to predict the change in outcome measures between the baseline assessment and the final (retention) assessment.

Regression performance
Of the regression models we tested, we found that Least absolute shrinkage and selection operator (LASSO) [25] models performed best. We also employed elastic nets [26], which generalize the LASSO method, Random Forests [27], and Least Angle Regression (LARS) [28] to simultaneously conduct a sensitivity analysis of our choice of LASSO penalty and establish benchmarks for predictive ability. We relied on both the root mean square error (RMSE; LASSO's success may be unsurprising, because it has the advantage of being able to narrow down the high-dimensional feature space to identify important features in cross-validation, and demonstrate the impact of those features on clinical outcomes. Prediction using LASSO was comparable to both the performances of LARS and a range of elastic net models with We decided to further examine LASSO models more closely to determine the smallest subset of features that can produce the same high performance as the other methods (which tend to use many more features). Our models predicted the change in patients' Wolf Motor Function Test (WMFT) with better coefficient of determination (mean ± sd: 86.07% ± 5.26%) than the change in Upper Extremity Fugl Meyer (65.34% ± 17.45%). Interestingly, first-order LASSO models performed better for predicting WMFT change, while second-order models (using the base 51 features, their interactions, and quadratic terms) performed better in predicting UEFM change.
The poorer prediction performance with UEFM may be partly due to its coarse, discrete nature. That is, the continuous nature of the model prediction is more precise than the discretely reported UEFM. This may inflate the resulting RMSE values. While categorization of UEFM and the subsequent use of logistic and multinomial models may offer a remedy, there are no clear guidelines for establishing thresholds for discretizing the measures.
We found that for WMFT, first-order models mostly performed better than the more complex second-order ones S2 Fig. This is likely due to the increased likelihood of over-fitting in the second-order case, leading to poorer performance under cross-validation. The pairwise interactions and second-order terms are also likely to magnify the multicollinearity problem. These issues are well known to degrade performance of the LASSO algorithm [25,29]. The lack of an advantage to using second-order models leads us to conclude that first-order LASSO models are sufficient for making predictions in the WMFT case. We are less confident in recommending this for the UEFM predictions because of lower predictability. An added advantage of first-order models is that they are easier to interpret and understand, and relationships between predictive features and the outcomes can be more readily translated into actionable clinical interventions.
In contrast, Random Forests exhibited poor performance with very low R 2 (for both firstand second-order models). R 2 was < 2.24% and < 4.68% for UEFM and WMFT, respectively. These R 2 values were consistent with high RMSE (� 5 UEFM and � 6.2 WMFT). This is likely due to either collinearity or sparsity (or both). The large 51-feature input space makes this problem implicitly sparse and the sparsity of the outcomes can be observed in Fig 2, where several extreme values are represented by only a single subject. Cross validation removed points that often led to complete exclusion of some regions of the input/output spaces, worsening predictions. Collinearity by itself has little effect on a Random Forests [30,31] because repeated random re-sampling can discriminate between collinear features; however, collinearity is likely to compound the effects of sparsity.
Another issue is that Random Forests tend to under-perform when the proportion of features that should have been selected (also called consequential covariates) is small ( [32], section 15.3.4). Poor performance is not unreasonable if only some features are pertinent to clinical outcome, and the likelihood of randomly selecting any consequential feature at each split in a decision tree is lower. However, our lack of a priori knowledge of salient features motivated this analysis. Nevertheless, because of poor prediction from Random Forests, our subsequent analytic approach below focuses only on the LASSO approach.
It is important to note that UEFM and WMFT measure different aspects of movement difficulties. Not only do UEFM and WMFT have an inverse relationship, UEFM measures motor ability while WMFT measures function. Concretely, where UEFM evaluates how well patients perform fundamental actions such as bending their elbow, WMFT measures the time it takes patients to effectively perform functional tasks such as grasping and transporting an object. In the sample of patients involved in this study, while participants with higher impairment levels (low UEFM) needed more time to complete functional tasks (high WMFT) (S6 Fig).

Feature importance
To identify a reduced feature space we were interested in the relative importance of predictor features. We used 4-fold cross-validation enumerate how often a feature was selected (feature sufficiency), and feature omission allowed us to measure the impact of removing a feature on model prediction ability (feature necessity). The features selected most frequently by LASSO to predict UEFM change (red diamonds on Fig 3) were age (younger patients improved more), height (taller patients improved more), and affected arm (non-dominant arm improved more). Next on the list were several movement features related to either speed or stability: variance of speed ratio (higher variance improved more), number of speed peaks (fewer speed peaks improved more), and maximum speed (lower speeds did better). While These results are consistent with previous research. It is known that cognitive performance declines with age [33]. Since height correlates with arm length [34], we posit that taller patients  had an easier time reaching their virtual targets. Our results also agree with the effect of handedness on stroke recovery discussed in [35]. Higher variability leads to more comprehensive and often better learning [36]. Stroke survivors had fewer submovements as they improved [37].
Another way to gauge importance was to see how model fit was influenced by excluding a feature (blue on Fig 3). The model fit median R 2 (over the 100 cross-validation runs) was most negatively impacted when either age, height or dominant side were removed, consistent with rank results above. Interestingly, removing some features resulted in changes in R 2 that spanned wide ranges and even had median improvement (such as when maximum trial time and sex were removed), suggesting that it was better to exclude these features from consideration. With this amount of variable data, no concrete statements can be made on the importance of these features on UEFM change.
Feature importances were more distinguishable when we inspected WMFT changes, with all of the top 10 selected by LASSO in nearly 100% of the cross-validation runs (Fig 4, Red). These were initial WMFT score (severely impaired improved more, which had the strongest correlation with WMFT change, r = 0.78), affected hemisphere (left side affected improved more), max path length ratio (higher ratios improved more), variance in number of speed peaks (more consistent improved more), mean max speed (this time higher speeds improved more), arrest period ratio (less time moving improved more), and age (younger improved more). It is also important to point out that mean maximum speed was next in correlation strength (r = 0.24).
That change in WMFT scores was best predicted by patients' initial performance shows that the Wolf Motor test itself is a robust, consistent measure of functional recovery. The other features deemed important to the prediction indicate possible interventions to improve these WMFT changes.
These results were also supported by feature exclusions. Predictions were most negatively impacted when initial WMFT was excluded, as well as the affected hemisphere (Fig 4, blue). Because our evaluations were on the cross-validation data, the model sometimes improved when excluding a feature. This was particularly true with max and variance of the speed ratio (Fig 4). Unlike UEFM, the effects on the WMFT R 2 had a smaller variance over cross-validation runs. The high variance and improvement of the mode upon their exclusion are key indicators that more data is necessary before conclusive statements regarding their importance may be made. The full list of feature ranks for predicting WMFT change is shown in the Supporting Information section S4 Fig. In any case, features nearly always selected by LASSO were deemed essential to prediction, while features never selected were deemed ineffectual. In the central region (diamonds roughly between 0.3 and 0.7 in Figs 3 and 4) were features the LASSO model could not reliably determine were important to the model. Generally, selection of these features was contingent on which data were available to the model at the time they are considered by LASSO.
As mentioned at the beginning of this section, each of these ranking methods, though powerful, has weaknesses. When shrinking the list of features, the LASSO algorithm elects to keep the first strongly predictive features it comes across, and shrinks all features highly correlated with those to zero, as their impact is negligible given the first features are already in the model. Therefore, features chosen using this method should be treated as motifs, whereby each is interpreted not as the raw quantity it represents, but as a thematic property of a patient's condition or ability to move that is helpful in predicting the outcome.
On the other hand, measuring the consequence of excluding an individual feature is not always accurate, because we cannot control for how other correlated features compensate for the drop in R 2 . One or more other factors may compensate, resulting in a small or negligible  Fig 3, proportion of cross-validations each feature was selected is shown in red. The blue points and boxplots show the effect of excluding each feature and rerunning the LASSO models with cross-validation. A patient's initial WMFT score and whether their left side was affected by the stroke are the two features whose removal most negatively impacts the prediction. Conversely, removing information about the patient's mass, stroke type and location was most helpful to the model, improving the adjusted R 2 . Notable among the top ten features is mean max speed, which showed a strong correlation with the outcome, indicating patients who were faster on the first day improved more on the WMFT scale. effect on R 2 . It is reasonable to expect that, for at least some of the features, the change in R 2 reported using our method was small in spite of the importance of that feature.
Random Forests provide an estimate of feature importance that is robust against high feature correlations and not vulnerable to some of the weaknesses of LASSO for feature ranking. We were unable to use that algorithm to rank features, however, due to small number of samples. As more clinical data appears as rehabilitation science matures, such tree methods may better inform our understanding of outcome predictions and feature importance.
It is important to note that our intervention (whether the patient received Error-Augmentation treatment) was not deemed useful or important in predicting our clinical outcome scores. Surprisingly, this is in spite of showing a significant benefit to the treatment type in this randomized, controlled clinical study. If large set of features used here demonstrates that other (perhaps superfluous and uncontrolled) features are much better predictors of outcome, one questions the meaning of the classic clinical test. Truly effective interventions should appear as consistent predictors. We posit here that our validation-intensive methods can verify if detected effects are confounded by other factors.
One might be concerned about whether we would have used all the movement data across the 20 trials undertaken by each patient rather than construct our features from summary statistics, where information might be lost or hidden. However, we believe there is an advantages to using summary statistics because they can robustly resist the spurious influences of random measurement error, while also allowing for easy interpretation. Our chosen breadth of different types of summary statistics (mean, maximum, and variance) led to effective prediction models that correlate with the clinical interpretation of central tendencies, best/worst performance, and consistency.
Another concern is that only main effects and not interactions were chosen from our data. However, our fundamental goal was to identify single factors that were related to outcomes, therefore we focused our feature ranking analysis on the linear models. Interaction models not only require more data, they also are more difficult to interpret in a predictive model with no a priori selected primary exposure variable.
What is most important is the implications these results might have for patients. Younger, taller, non-dominant-affected arm individuals were more likely to improve their abilities (UEFM). Younger, more severely impaired, left arm affected individuals were more likely to improve their function (WMFT). What is also important is what these results might suggest for altering treatment strategies. While fewer speed peaks and lower maximum speeds were loosely related to ability as measured by UEFM gains, UEFM models were less successful and therefore not as reliable as the WMFT models. For the WMFT predictions, the set of highly ranked features included movement speed. Speed also had a strong correlation with outcome and suggests that interventions focusing on speed might improve prognosis. It remains to be seen whether a therapy that encourages faster movements might lead to better functional recovery.

Conclusion
Changes in motor ability (UEFM) and motor function (WMFT) can both be predicted by our models. Change in WMFT is easier to predict since it is a continuous measure. Both changes in UEFM and WMFT can be linked to specific movement features as well as patient demographics and clinical characteristics. Our validation approach also allowed us to measure the certainty of our findings. Since we are unable to affect demographics or clinical characteristics, features that we can influence during rehabilitation are the most critical. This work suggests that speed would be a good first target for further study.

Ethics statement
This work was approved by the University of Illinois at Chicago's Institutional Review Board and Northwestern University's Institutional Review Board. This work conforms to the Declaration of Helsinki for research involving human subjects. All participants provided written consent to participate in the study using consent procedures approved by both Institutional Review Boards.

Patient selection & initial evaluation
We enrolled twenty-six chronic hemiparetic stroke survivors in our study. Participants had mild to moderate impairment, determined by their intake Fugl-Meyer scores (range 25-49) and were selected according to the criteria outlined in [38,39]. Patients' stereoscopic vision was tested using the Stereo Fly Test. Their reaching abilities were then evaluated before starting the study under similar conditions. Patients were instructed to reach with both arms in parallel, without crossing the midline, to two targets in a three-dimensional virtual reality environment ( Fig 1A). Each subject underwent baseline and post-intervention evaluations consisting of a battery of clinical assessments performed by a therapist, followed by 20 bimanual reaches (trials) in the virtual reality environment, each to one of four target locations chosen pseudorandomly.

Intervention
Patients were block-randomized controlling for age and impairment as closely as possible. Both patient groups trained for two weeks using a Phantom 1 3.0 robot arm. The control group received no intervention and used a passive robot arm, while the treatment group experienced disturbance to their paretic arm, in the form of visual and haptic Error Augmentation (EA) [38,40]. All patients were evaluated again immediately after the end of training, with final evaluation taking place one week later to assess longer-term recovery effects. Our main outcome measures were changes in the patients' clinical scores, as evaluated by a therapist, between the first evaluation (prior to beginning the study) and the final evaluation three weeks later. Specifically, our clinical outcomes were the patients' Upper Extremity Fugl-Meyer (UEFM), which measures motor ability, and the Wolf Motor Function Test (WMFT), which measures completion time for functional tasks. This protocol is summarized in Fig 1B.

Construction of the feature set
We gathered a total of 51 features from two sources, either demographic/physiological characteristics, and descriptive statistics of movement (Table 1). Demographic and physiological features were denoted Z il , i = 1, . . ., N, l = 1, . . ., q, where q is the number of variables, were noted at baseline. A battery of p measured movements were observed in T = 20 trials for each i-th subject. These measurements, X ijk , j = 1, . . ., p, k = 1, . . ., T, in T = 20, were used to compute baseline summary features across trials for each study participant: The mean ( � X ij� ), maximum (M ij� ), and variance(V ij� ) movement features. A descriptive summary of demographic and moment variables is provided in Table 1.
Features were primarily based on common metrics or were reported in previous research [18,41,42]; a few of the features were newly explored in this work. These included: (a) performance-related measures evaluating error, speed, and reaction time, (b) descriptive features such as hand path length and trial time, and (c) patients' demographic and clinical characteristics such as height, weight, stroke location, affected side, and initial clinical scores. Ultimately, our first-order feature set contained 51 features, and used summaries of the movement features across the 20 evaluation reaches (mean, maximum, and variance of each feature).
Most of the features we used (as detailed in Table 1) are fairly straightforward. A few are, however, somewhat ambiguous. For trial k, we defined speed ratio as the speed of the first launch divided by the maximum speed, while path length ratio is the distance traveled by the subject's arm divided by the straight line distance from the home position to the target. We defined reaction time as the time between the appearance of the target and the subject crossing our pre-defined threshold of 0.06ms −1 . Mean Arrest Period Ratio (MAPR) is the time the subject spent below 10% of their maximum speed for that trial divided by the total trial time. Finally, Percentage of Movement in Target Direction (PMTD) is defined as the proportion of the distance traveled during a trial in the effective direction to the target, defined formally as the sum of the projections of the distance traveled between two time samples onto the straight line path to the target, divided by the total hand path length.

Predictive models
Movement features and patient characteristics were used to predict change in clinical outcomes: Upper Extremity Fugl-Meyer (UEFM) and Wolf Motor Function Test (WMFT). Since the number of possible predictive features is larger than the number of observations (patients), the most likely models to succeed used algorithms that shrink the number of features to avoid overfitting. These models included Elastic Net [26]. Elastic Net employs penalized linear regression with a parameter (0 � α � 1) that balances l 1 and l 2 norm penalties (more on that below). On one end (α = 0) there is ridge regression [43], purely penalizing the l 2 norm of the coefficients in the model, and on the other (α = 1) is LASSO (Least Absolute Shrinkage and Selection Operator) [25], purely penalizing the l 1 norm of the coefficients. We employed Least Angle Regression (LARS) [28]), a less greedy forward feature selection algorithm. Finally, we tried an algorithm that utilizes decision trees to make predictions (namely Random Forests [27]).
Our first-order prediction models used 51 features (many more for second-order case) to predict changes in clinical scores for 26 stroke survivors. Since this is an overdetermined problem that would be guaranteed to cause overfitting, the most likely algorithms to succeed would have to reduce the number of features used by the model. One such algorithm that we implemented was Elastic Nets, represented by the following formula: where . . . ; Z iq �, p � = 3p + q, t 1 and t 2 are regularization terms related to the penalty, placing an upper limit on the sum of the first and second norms of predictor coefficients, and β j is the coefficient of the j-th feature. In Lagrangian form: where λ 1 and λ 2 are the penalty for the sum of l 1 and l 2 norms of the coefficients, respectively. We tested a range of α values from 0 to 1 with increments of 0.1-where α = 0 corresponds to a penalty purely based on the l 2 norm of the coefficients (ridge regression) and α = 1 corresponds to a penalty based purely on the l 1 norm (LASSO regression). LASSO drives a lot of the predictor coefficients to zero and simplifies the resulting model [25] but sometimes over-regularizes. On the other hand, ridge regression keeps all coefficients in the model and drives the less useful ones close to zero, but is more difficult to interpret since it does not remove any features from the model [43]. Intermediate α values attempt to balance removing predictors from the model and driving their coefficients close to zero. Least Angle Regression (LAR) adds coefficients to the model in a stepwise manner starting with the most correlated with the outcome being modeled, then adding coefficients in order of correlation with the residual from the previous step. LAR may behave in a more stable fashion than regularized regression in some cases but is also highly affected by noise [ [28], discussion by Weisberg].
The chosen model in the case of regularized regression and LAR was the maximum λ where mean cross-validation error was within 1 standard error of its minimum value, as in [25].
Random Forests constructs an ensemble of decision trees, each based on a randomly chosen subset of the observations and features (therefore providing validation via random sampling, and ensuring the trees do not overfit). Each decision tree is constructed to minimize the mean square error at each split, and contributes a "vote" to the ensemble. The value predicted by the random forest for a new observation is either a majority vote (classification problems) or a mean predicted value (regression problems).
For Random Forests, we built large ensembles (50,000 regression trees) with 100 repeats of 4-fold cross validation. The large number of trees was used to ensure the algorithm has adequate usage of each feature to assess its importance, especially in the case of second-order models (> 1300 predictors).
Estimation was performed under cross-validated to avoid over-fitting and reduce the influence of outliers. Over-fitting was a concern because of the limited sample size and the eventual inclusion of second-order terms comprised of quadratic and pairwise interaction variables. We evaluated both first-and second-order models by looking at their variance explained (measured using adjusted R 2 ) and prediction error distribution (RMSE). The process was identical for predicting both UEFM and WMFT changes. We repeated 4-fold cross-validation 100 times to obtain a range for each prediction quality metric, and ensuring we took into account different data splits.

Ranking the features
Since LASSO regression has no built-in method for ranking model predictors and reducing the dimensionality of the input feature space needed to predict UEFM and WMFT changes, we devised two methods to unpack our LASSO models and determine the relative importance of features instrumental to clinical outcome prediction.
First, we examined the shrunk feature set resulting from each of our 100 4-fold cross-validation repeats. We used the proportion of these repeats each feature was selected as our main measure of that feature's importance for prediction. This gave each feature a rank from 0 (always shrunk, unimportant for prediction) to 1 (never shrunk, essential for prediction).
Second, we excluded individual features and calculated the difference in prediction R 2 when the feature was present and the R 2 ðÀ jÞ when the feature was removed. We again used 4-fold cross-validation with 100 repeats. To avoid making a distributional assumption for the change in the coefficient of variation, D j ¼ R 2 À R 2 ðÀ jÞ , we used its median as a measure of relative importance. Features whose removal resulted in larger Δ j were deemed more important to the model. Random forests are a powerful method [27] for ranking features in high-dimensional data by their relative importance in predicting an outcome variable. However, we saw only limited success in their ability to identify important features in this data set. This is primarily due to their failure to effectively predict changes in clinical outcomes, likely due to the small yet highly sparse data.
Supporting information S1 Fig. Cross-validation folds and repeats were balanced when considering individual features. There were no obvious outliers when examining each feature's mean for each fold during cross-validation. Cross-validation means averaged to zero across all folds/repeats. This is more complex when considering second-order models, but the basic sampling in our crossvalidation was balanced. Proportion of cross-validations each feature was selected is shown in red. The blue points and boxplots show the effect of excluding each feature and rerunning the LASSO models with cross-validation. A patient's initial WMFT score and whether their left side was affected by the stroke are the two features whose removal most negatively impacts the prediction. Conversely, removing information about the patient's mass, stroke type and location was most helpful to the model, improving the adjusted R 2 . Notable among the top ten features is mean max speed, which showed a strong correlation with the outcome, indicating patients who were faster on the first day improved more on the WMFT scale. UEFM was more sensitive to patients with relatively higher functional ability, while WMFT was more sensitive to those with lower functional ability. WMFT scores plateaued for patients showing larger UEFM changes. This relationship between UEFM and WMFT may explain our observation that slower speed predicted better recovery for UEFM while higher speeds were predictive of faster WMFT times. Changes in clinical scores were not statistically significant after the intervention. (TIF)