Computer-aided identification of degenerative neuromuscular diseases based on gait dynamics and ensemble decision tree classifiers

This study proposes a reliable computer-aided framework to identify gait fluctuations associated with a wide range of degenerative neuromuscular disease (DNDs) and health conditions. Investigated DNDs included amyotrophic lateral sclerosis (ALS), Parkinson’s disease (PD), and Huntington’s disease (HD). We further performed a statistical and classification comparison elucidating the discriminative capability of different gait signals, including vertical ground reaction force (VGRF), stride duration, stance duration, and swing duration. Feature representation of these gait signals was based on statistical amplitude quantification using the root mean square (RMS), variance, kurtosis, and skewness metrics. We investigated various decision tree (DT) based ensemble methods such as bagging, adaptive boosting (AdaBoost), random under-sampling boosting (RUSBoost), and random subspace to tackle the challenge of multi-class classification. Experimental results showed that AdaBoost ensembling provided a 6.49%, 0.78%, 2.31%, and 2.72% prediction rate improvement for the VGRF, stride, stance, and swing signals, respectively. The proposed approach achieved the highest classification accuracy of 99.17%, sensitivity of 98.23%, and specificity of 99.43%, using the VGRF-based features and the adaptive boosting classification model. This work demonstrates the effective capability of using simple gait fluctuation analysis and machine learning approaches to detect DNDs. Computer-aided analysis of gait fluctuations provides a promising advent to enhance clinical diagnosis of DNDs.


Introduction
Human motion is controlled by the neuromuscular system, which comprises all muscles, sensory neurons, and motor neurons [1]. Degenerative neuromuscular disease (DNDs) arises from the degeneration or progressive loss of the function in efferent or afferent nerves. Efferent nerves are responsible for controlling voluntary muscles, while afferent nerves communicate sensory information back to the brain and the central nervous system [2]. Examples of conditions simultaneously. In accordance, a limited number of recent studies tackled multiclass classification of DNDs. In [24], Beyrami et al. used a wide range of statistical and entropy features alongside a non-negative least squares (NNLS) classifier. This approach was applied to raw short-length VGRF signals only. Lin C-W et al. [25] investigated recurrence plot and principal component analysis to transform time-domain VGRF signals into images. These images were inputted as features to a CNN model for classification. On the contrary, Alaska et al. [19] compared the performance of several classification models, namely artificial neural network (ANN), KNN, linear SVM, and RF. In the feature transformation process, extracted temporal and spectral features included independent reconstruction components, approximate entropy, standard deviation, minimum, maximum, and mean values, and the ratio of peak-magnitude to root-mean-square. According to previous studies, deep learning-based models tend to exhibit a highly auspicious performance when classifying DNDs in both binary and multiclass contexts. In most cases, complicated preprocessing and feature engineering techniques were also used. Compared to traditional machine learning and pattern recognition methods, training and validating a reliable deep learning architecture requires significant computational resources. Typically, this process is iterative, involves multiple model parameters, and entails specialized graphical processing units. The lack of sufficient, high-quality, and comprehensive clinical data is also considered amongst the main limitations. To exploit the value of automated disease detection systems in resource-constrained settings where only small datasets and low-cost hardware devices are available, simplistic computational approaches to characterize and classify gait patterns are worth investigation.
To address the shortcomings of previous works, this study proposes a simple yet reliable computer-aided framework that simultaneously detects a wide range of DNDs based on gait dynamics. Our primary objective is to perform a comparative performance investigation for different combinations of spatiotemporal gait patterns and ensemble classification methods. To this end, we first proposed a new approach to derive spatiotemporal gait cycle time series from VGRF signals. This approach was applied to derive parameters such as stride duration, stance duration, and swing duration. Feature characterization of the VGRF signals and the spatiotemporal gait signals was based on the statistical descriptors of root mean square (RMS), variance, skewness, and kurtosis. These descriptors were applied to raw short-length signals to maximize data availability and support the proposed framework's computational efficiency. Finally, we compared the performance of various DT ensemble models based on the concepts of bagging, adaptive boosting (AdaBoost), random under-sampling boosting (RUSBoost), and random subspace. Fig 1 illustrates the DNDs detection framework employed in this study.
This paper is organized as follows. Section 2 provides a complete description of the proposed framework and the adopted methodology in this study. Section 3 presents some statistical observations on the features extracted for various disease conditions and gait signals. Moreover, it compares the performance of the ensemble classification models as applied to each of the investigated gait signals. In section 4, an in-depth discussion of the results obtained compared to other recent studies in the literature is provided and the methodological limitations of this work are highlighted. Finally, section 5 concludes this paper.

Dataset description
In this study, we used the publicly available Physionet database for neurodegenerative diseases gait patterns [26,27]. This dataset comprised of a total of 48 recordings spanning three different disease conditions: amyotrophic sclerosis (13 patients), Huntington's disease (20 patients), and Parkinson's disease (15 patients). The dataset also included 16 healthy control subjects. Table 1 provides a characteristic and demographic summary of the subjects involved. The raw VGRF gait signals, representing the force measured under each foot separately, were recorded using eight distinct distributed force sensors under each foot (2 channels, left and right). The VGRF signals were recorded at a sampling rate of 300Hz. All the subjects were instructed to walk continuously at their average pace along a 77m hallway for 5min. When the hallway end was reached, the subjects had to turn around and walk in the opposite direction. Before data preprocessing, the data points corresponding to the first and last 15s were eliminated to reduce artifacts caused by movement start or end, as recommended by the previous work of Hausdorffet al. [26,28]. Extreme spike values lead by the end of hallway turn-backs were corrected using a median filter [21,29]. To maximize the number of available training instances,

PLOS ONE
we segmented the 5min signal recordings into multiple 30s windows without overlapping. Excessively noisy data windows were identified and discarded by manual inspection. Each window was then considered as an independent signal sample in the feature extraction and classifier training-validation process. Table 1 shows the final number of signal samples associated with each disease category after preprocessing.

Extraction of spatiotemporal gait parameter signals.
In addition to the raw VGRF signals, other spatiotemporal gait parameters, such as stride duration, swing duration, and stance duration, were derived for each left and right foot independently. According to the GAITRite reference system, gait events are defined based on changes in foot-floor contact patterns. A gait cycle starts with a stance phase, during which the foot remains in contact with the ground [30,31]. Thus, the stance duration parameter refers to the time elapsed between a heel-strike action and a subsequent toe-off action. Following is the swing phase corresponding to the stage where the foot is off-ground; the corresponding swing duration parameter bounds the toe-off action and the next gait cycle's heel-strike action. The stride combines both the stance and swing phases and corresponds to a complete gait cycle. The stride duration parameter estimates the time length of a single gait cycle marked by two successive heel-strike actions. [32,33] Fig 2 illustrates the stride, stance, and swing phases on the VGRF signal marked by the heel-strike and heel-off events. In order to facilitate the identification of heel-strike and toe-off points, the VGRF signals were first approximated as bilevel waveforms using the histogram methods described in [34]. At first, each VGRF signal was realized as a random variable, and the underlying probability distribution was non-parametrically constructed by binning the signal to a uniform-bin-width histogram. The appropriate histogram range and number of bins were adaptively determined for each signal. Let A be a VGRF signal with a maximum amplitude A max , a minimum amplitude A min , the histogram range A R was calculated using: The optimal bin-width was determined using Scott's normal reference rule [35]: whereŝ is the standard deviation of the signal and n is the total number of time samples. Accordingly, the total number of equal-sized bins was found as: The constructed histogram was further divided into two sub-histograms, a lower state histogram H L with L bins and an upper state histogram H U with U bins, according to the following criteria: where i low is the lowest index and i high is the highest index in the main histogram. The lower and upper state levels were then estimated as the mode of H L and H U , respectively. Finally, to identify the gait events of interest, a 10% reference was set above the lower level estimated from H L . For a lower bilevel S L and an upper bilevel S U , the 10% reference level was set as: The heel strike point was estimated as the time instant when the positive-going transition of the VGRF signal crosses the 10% reference. Similarly, the toe-off point was estimated as the time instant when the negative-going transition crosses the same 10% reference level. illustrates the estimated upper and lower histogram bilevels, the 10% reference level, and the corresponding heel-strike and toe-off points for a sample VGRF signal.

Feature extraction
The features extracted in this study included the RMS, variance, kurtosis, and skewness. These linear features provide a simple way to statistically quantify temporal changes in the amplitude, structure, and regularity of the gait signals, thus, making them an ideal option for computeraided diagnostic tools and real-time disease detection applications.
The root-mean-square statistic (RMS) is defined as the square root of the arithmetic means of the squared of a signal A: where N is the number of time samples making up the signals A. The variance (var) in statistics measures the spreadness of the signal's amplitude around its mean and is mathematically defined as: where μ s is the mean of A given by: The skewness (Sk) is used as a measure of amplitude asymmetry around the mean and can be computed as: The kurtosis (Ku) measures the degree to which the signal distribution is prone to outliers and is calculated as: The statistical temporal features were then extracted independently from each sample signal, i.e., left and right raw VGRF signals or gait parameter signals (stride, stance, and swing). The final feature vectors were formed by concatenating the statistical metrics extracted from each signal type separately. Accordingly, four distinct feature vectors, each is of size 1 × 8, were considered for classification.

Classification models
Decision Tree (DT) is a popular supervised machine learning algorithm and is amongst the most simplistic and intelligible predictive modeling approaches. As its name suggests, a DT can be thought of as a tree with root nodes, internal leaf nodes, and branches. The root nodes represent the features, the leaf nodes represent the class labels, and the branches represent the conjunctions connecting features to their class labels. The model performance depends on how well the tree is constructed from the training data. In this work, the classification and regression tree (CART) algorithm was employed to construct the DT models at the training stage [36,37]. The Gini's diversity index was employed as the root node split criterion [38].
Different DT ensemble variations were also employed for classification, namely bagging, AdaBoost, RUSBoost, and random subspace. All investigated models were implemented following their binary realizations, and the multi-class classification problem was handled through a one-versus-all error-correcting output code ensembling. In this approach, the multi-class classification decision is made by combining the predictions of multiple base classifiers. Each base classifier performs a single binary classification task targeted towards detecting a single class from the rest [39]. The mathematical formulation of these classification methods is detailed in [40][41][42][43][44].
Before model training, a 10% sample subset was randomly selected from the overall dataset for tuning the classifiers' parameters. Hyperparameter tuning was done via Bayesian optimization with a cross-validation loss cost function. Table 2 summarizes the parameters selected for each classification model and feature vector after optimization. The complete training and validation analysis was performed via Matlab software (R2020a, Natick, Massachusetts, USA).

Classification performance evaluation
To get a robust estimation of the overall classification performance, the models were trained and tested using 10-folds cross-validation. To account for data imbalance, the folds were divided using an equi-stratified approach. The folds had the same number of samples (without repetition) with a class distribution following the overall dataset. The performance evaluation metrics included accuracy, sensitivity, specificity, F1-score, and Cohen's kappa coefficient (κ). Provided below are the confusion matrix-based definitions for each of these metrics: where the true positives (TP) and the true negatives (TN) represent the count of correctly classified audio signals, while the false positives (FP) and false negatives (FN) represent the number of signals incorrectly classified. Po is the relative agreement between raters, and it is equivalent to the classification accuracy, while Pe is the hypothetical probability of agreement by chance and can be calculated as [45]:

Features distribution
A one-way analysis of variance (ANOVA) was conducted to assess the significance of each statistical feature derived from each gait signal separately. The compared ANOVA levels corresponded to the disease conditions, namely control, ALS, PD, and HD. Since the feature distributions were non-normal, as revealed by the Kolmogorov-Smirnov test, a non-parametric Kruskal-Wallis ANOVA was employed. Moreover, the post-hoc Dunn-Sidák approach was used to perform pairwise comparisons between disease conditions. The tests were performed with a 95% confidence interval to verify the statistical significance of the extracted features. It is worth noting that for each feature type, uniform sample size was maintained between disease levels. A 130 sample size was selected to match the ALS category having the smallest number of samples. Figs 4-7 visualize the features distribution for the VGRF, stride, stance, and swing signals, respectively. The P and chi-square (x 2 ) values on the plots represent the results of the Kruskal-Wallis test. The asterisks represent the pairwise comparison results between disease classes after applying Dunn-Sidák correction ( � :p � 0.05, �� : p � 0.01, ��� :p � 0.001). In general, the results positively confirmed statistical significance between different disease conditions. The investigated statistical features were highly sensitive to changes in gait dynamics between disease conditions, thus providing a promising outlook into using them for the classification analysis. Table 3 compares the performance of the investigated classification models as applied to the features derived from the VGRF, stride, stance, and swing signals independently. The tabulated values represent the average of the classification accuracy, sensitivity, specificity, F-score, and Cohen's kappa coefficient over validation folds.

VGRF-based features.
Using the statistical features derived from the left and right VGRF signals, the results show that the base DT model performed poorly compared to the other ensemble classifiers with an overall average accuracy of 93.13%, sensitivity of 98.23%, specificity of 99.43%, F1-Score of 85.97% and Cohen's kappa coefficient of 81.43%. Using the AdaBoost ensemble approach, the overall performance notably improved, providing an average classification accuracy of 99.17%. Using the same classification model, the sensitivity, specificity, F1-score, and Cohen's kappa coefficient metrics reached 99.17%, 98.23%, 99.43%, 98.28%, and 97.73%, respectively. Slightly lower classification accuracies were observed for the random subspace (97.30%), Bagging (96.57%), and Boosting (96.66%) ensembles. Table 3, the AdaBoost model provided the highest detection accuracy for the stride-based feature set at an overall average of 97.68%. The DT and random subspace models a relatively lower classification accuracy of 79.06%. The highest sensitivity (58.65%), specificity (86.26%), F1-Score (58.10%) and Cohen's kappa coefficient (44.60%) were also obtained by the AdaBoost classifier. On the contrary, the Bagging ensemble model provided the worst performance, as demonstrated by its classification accuracy of 78.53%. All other metrics dropped to 56.31%, 85.34%, 55.17%, and 41.09% for the sensitivity, specificity, F1-score, and Cohen's kappa coefficients, respectively.

Stance-based features.
The best classification performance for the stance-based feature set was attained using the AdaBoost classifier at an accuracy of 81.98%. Concurrently, the highest sensitivity of 63.54%, specificity of 87.81%, F1-score of 63.18% and Cohen's kappa coefficient of 51.18% were obtained using the same model. The Random substance ensemble

Swing-based features.
The swing-based features displayed a similar pattern to that obtained using the stride and stance features. The AdaBoost model provided a superior overall performance with an accuracy of 79.30%, sensitivity of 55.97%, specificity of 85.71%, F1-score of 56.19%, and Cohen's kappa coefficient of 42.51%. Following, the RUSBoost and random subspace ensembles demonstrated a slightly worse performance with overall overage accuracy of 77.73% and 77.71%, respectively. The worst performance among all classification models and feature sets was obtained using the bagging ensemble as evidenced by the obtained accuracy (77.20%), sensitivity (49.56%), specificity (84.08%), F1-score (47.42%), and Cohen's kappa coefficient (33.38%).
Considering that AdaBoost yielded the best improvement, it was used to gauge the effectiveness of detecting particular disease conditions. Fig 8 compares the class-specific results associated with the best performing AdaBoost model for each gait time-series signal. For the VGRF feature set, the highest class-specific accuracy was achieved for the HD classes at an average of 99.4%. The CON, ALS, and PD classes were associated with the classification accuracy of 98.8%. For the stride, stance, and swing feature sets, the highest accuracies, F1-scores, and Cohen's kappa coefficients were associated with the ALS class at the ranges of 83.75%-82.29%, 70.68%-67.68%, and 59.45%-55.49%, respectively.

Discussion
This work aimed to provide an efficient computer-assisted approach for identifying gait dynamics associated with healthy versus various DND conditions. To this end, we propose a simple yet effective framework incorporating two main stages: (1) extracting statistical temporal features from different types of gait signals and (2) and performing multi-class classification using supervised machine learning approaches. We investigated the efficiency of using ensemble learning systems, namely bagging, AdaBoost, RUSBoost, and random subspace. Moreover, we carried out a detailed statistical and classification comparison between the features extracted from different gait signals, namely left and right ground reaction force, stride, stance, and swing signals. The prospect of machine learning usually requires extensive data transformations to provide the best possible training set to the learner. Amongst the main aspects related to data transformation is feature extraction. Optimal feature extraction provides a better representation of patterns under investigation and improves the models' predictive performance. In our proposed framework, gait signals were characterized based on statistical features, including the RMS, variance, kurtosis, and skewness. One of the main strengths of the proposed framework is using a limited number of simple features to characterize gait dynamics. These features were derived directly from raw and short-length gait singles without applying extensive preprocessing or complex filtering or transformation techniques. Such characteristic adds to the computational efficiency of the proposed framework and facilitates its application in real-time settings. Despite the simplicity, our statistical analysis showed that these features positively represented characteristic variations between disease groups. Post-hoc comparisons revealed that the features derived from the raw VGRF signals corresponded to more significant pairwise group differences.
For DND detection, we employed four types of ensemble classifiers: bagging, AdaBoost, RUSBoost, and random subspace. In order to highlight the significance of ensembled predictions, we also considered the performance of the base decision tree model. In line with the statistical analysis results, the classification analysis showed that the models' predictive performance was influenced by variability in the gait feature. Evaluation of classification performance further emphasized that the VGRF-based feature set exhibited a notably higher predictive efficiency than the other three feature sets, regardless of the classification model used. Our target of achieving a high-performance detection framework was accomplished using the AdaBoost classifier in conjugation with the VGRF-based feature set, with an average classification accuracy of 99.17%. Correspondingly, the class-specific accuracies of 98.8%, 98.8%, 98.8%, and 99.4% were achieved for the control, ALS, PD, and HD groups, respectively. Similarly, using the features extracted from the gait parameter signals, the AdaBoost model generally provided superior performance. However, we obtained a lower overall accuracy of 81.98% for the stance-based feature set, 79.68% for the stride-based feature set, and 79.30% for the swing-based feature set.
Worth noting, the classification results provided empirical evidence suggesting that ensemble classifier systems are better performers than their constituent base models. Using the base decision tree model, the VGRF-based features set provided the best classification performance, but ultimately, all ensemble techniques improved classification results to varying degrees of success. The AdaBoost yielded the most considerable improvement in all metrics, with an improvement percentage of 6.49%, 13.74%, 4.26%, 14.32%, 20.02% for the accuracy, sensitivity, specificity, F1-score, and Cohen's kappa coefficient, respectively. Following Adaboost, random subspace performed best with a 4.48% increment in the accuracy, 9.14% in the sensitivity, and 2.89% in the specificity. Slightly lower performance improvements were associated with bagging and RUSBoost models. For the gait parameter signals, the AdaBoost model showed the most notable performance improvement. The associated percentage increase in detection accuracy, sensitivity, and specificity ranged between 0.78%-2.72%, 2.14%-8.93%, and 0.61%-1.64%, respectively.
The physionet gait database was used in a few recent studies to perform a multi-class classification of neurodegenerative diseases. Table 4 provides a comparative summary of these works. In agreement with our proposed framework, adopted literature approaches generally integrated a wide range of feature extraction methods with supervised machine learning classification. The feature extraction methods for the VGRF signals included statistical amplitude quantification, detrended fluctuation analysis, and fractal dimension. The features characterizing gait parameter signals were based on statistical amplitude quantification and recurrent analysis. For the classification task, a limited range of standard algorithms was explored, i.e., adaptive boosting trees, random forests (RF), support vector machines, and sparse non-negative least squares (NNLS) coding. Athisakthi et al. reported the highest accuracy for the parameter gait signal through using Wavelet transform-based statistical features and RF classifier (stride 91.75%, stance 93.74%, and swing 93.7%) [46]. However, the best overall accuracy of 98.45% was obtained through statistical characterization of VGRF signals alongside NNLS coding classification [24]. Thus, it can be noted that our proposed framework significantly improved the neurodegenerative disease recognition rate in comparison to the state-of-the-art methods in the literature. Potential advantages of such an accurate diagnostic system include aiding in smart long-term monitoring. This also supports clinicians and care providers with noninvasive and low-cost tools to aid in making diagnostic decisions. A possible explanation for the relatively lower accuracies obtained using the stride, stance, and swing parameters signals might be due to the small-length raw VGRF signals used to derive these signals. However, this approach was followed since the available dataset is not large enough to perform multiple fold validation.
There may exist several methodological limitations in this study. Patient-specific factors such as age, gender, and disease severity were incongruent between different disease groups. Inconsistencies in such subject-specific factors could have a direct effect on the classification model's predictive performance. The availability of a more comprehensive dataset is essential to investigate the impact of these factors and, therefore, support the generalization ability of the proposed framework.

Conclusion
This paper investigates the application of ensemble classification to identify different DNDs. Based on normal-paced gait fluctuations, healthy and disease conditions were characterized using spatiotemporal statistical features derived from VGRF signals. For the classification task, several ensemble classification approaches were investigated based on a base decision tree classifier. A data-driven hyperparameter tuning approach using Bayesian optimization was employed to select the most proper parameter for all classification methods. The obtained results demonstrated the promising capability of detecting common DNDs, with the highest overall classification rate of 99.17%. Thus, the proposed framework is applicable to aid in diagnostic decisions while considering computing hardware resource-restricted environments. This framework can be extended in future work to include other types of DNDs and spatiotemporal gait patterns. However, this requires further experimentation spanning a broader range of subjects and disease conditions. Moreover, the investigation of other feature extraction approaches and deep learning classification models is expected to improve classification performance.