Rett syndrome severity estimation with the BioStamp nPoint using interactions between heart rate variability and body movement

Rett syndrome, a rare genetic neurodevelopmental disorder in humans, does not have an effective cure. However, multiple therapies and medications exist to treat symptoms and improve patients’ quality of life. As research continues to discover and evaluate new medications for Rett syndrome patients, there remains a lack of objective physiological and motor activity-based (physio-motor) biomarkers that enable the measurement of the effect of these medications on the change in patients’ Rett syndrome severity. In our work, using a commercially available wearable chest patch, we recorded simultaneous electrocardiogram and three-axis acceleration from 20 patients suffering from Rett syndrome along with the corresponding Clinical Global Impression—Severity score, which measures the overall disease severity on a 7-point Likert scale. We derived physio-motor features from these recordings that captured heart rate variability, activity metrics, and the interactions between heart rate and activity. Further, we developed machine learning (ML) models to classify high-severity Rett patients from low-severity Rett patients using the derived physio-motor features. For the best-trained model, we obtained a pooled area under the receiver operating curve equal to 0.92 via a leave-one-out-patient cross-validation approach. Finally, we computed the feature popularity scores for all the trained ML models and identified physio-motor biomarkers for Rett syndrome.


Introduction
Rett syndrome is a rare genetic neurodevelopmental disorder that occurs primarily in girls [1]. Though Rett syndrome is a clinical diagnosis, more than 95% of cases are caused by mutations in the gene encoding the methyl-CpG binding protein 2 (MECP2), a transcriptional regulator involved in chromatin remodeling and the modulation of RNA splicing [2]. Rett syndrome affects 1 in 10000 females and is characterized by a period of apparently normal postnatal development followed by developmental delay and loss of acquired skills resulting in multiple time scales and derived network characteristics at each time scale. These network representations revealed more nuanced characteristics of the time series being analyzed. The goal of our study was two-fold. First, we wanted to develop machine learning (ML) classification models to classify patients with low-severity Rett syndrome (CGI-S �4) from patients with high-severity Rett syndrome (CGI-S >4) based on the objective measures attained from a wearable biosensor. Second, through the classification experiment, we wanted the trained models to provide us with important features (physio-motor biomarkers) that could help us distinguish the two groups. Hence, we developed binary classifier models based on raw data recordings using metrics derived from the following feature sets: (1) HRV metrics, (2) Actigraphy metrics, (3) MSTE-features, and (4) MSNR-features. We used the least absolute shrinkage and selection operator (LASSO) for model training and developed logistic regression models with the L1-penalty. We developed separate models for each of the four feature sets listed above and for all possible two, three, and four combinations of these feature sets. In total, we developed 15 binary-classification models for Rett syndrome severity classification. Finally, we listed the models' features that were important for Rett syndrome severity classification. We illustrate the complete pipeline of our work in

Data collection
The dataset for this work was sourced from two Institutional Review Board (IRB)-approved studies: (1) The Triheptanoin-clinical trial [27] (2) The Outcome measures and biomarkers development study [28]. The data were collected between January 2016 and December 2018-a three-year period. We used the body-worn patch BioStamp 1 (MC10 Inc., Cambridge, MA, USA) [29] to record ECG and three-axis acceleration from all the participants. While some ECG records were captured at a sampling rate of 125Hz, others were captured at a sampling rate of 250Hz. Concurrently, the three-axis acceleration records were captured at the sampling rates of 31.25Hz and 62.5Hz, respectively. These differences did not meaningfully influence the HRV and activity metrics we extracted [30]. We captured the ECG signal and the three-axis acceleration from the following four locations on the body: (1) Medial chest, (2) Left Hypochondrium, (3) Right Hypochondrium, and (4) Left Pectoralis. Per the protocols, all four locations were not used for all the participants, and only a subset of these locations was used for each participant. In conjunction to the signal data obtained from the biosensors, caretaker and physician surveys were conducted to obtain symptom severity for all 20 patients enrolled in the study. Specifically, the CGI-S scores were obtained through physician surveys to assign a binary label (low-severity vs. high-severity) for each patient-visit. A patient-visit was assigned to the low-severity category if the CGI-S �4 and was assigned to the high-severity category if the CGI-S >4. For each patient-visit we needed two consecutive days of signal data for the feature extraction. By applying this filter, we obtained a total of 32 patient-visits with two consecutive days of signal data and the associated CGI-S label. Among the 32 patient-visits, we had 18 highseverity visits corresponding to 10 unique patients and 14 low-severity visits corresponding to 11 unique patients. One patient had both low-severity and high-severity visits. We considered each patient-visit a data point and had a total of 32 data points with an associated binary label for model development and analysis. The racial, ethnic and age breakdown of the study subjects are illustrated in Fig 2A-2C respectively. The low and the high-severity groups had a similar split in patients with respect to the race, ethnicity and age. We had 8 patients each in low and high-severity groups whose race was "Caucasian" and 2 patients each in the two groups who's race was "Asian". One patient's race was unknown. Further, we had 8 patients each in low and high-severity groups whose ethnicity was "Non-Hispanic" and 2 patients each in the two groups whose race was "Hispanic". One patient's race was unknown. Finally, the median age at enrollment in the low and high-severity groups were 8.5 and 8 years respectively. Study approval. This study was approved by the Emory Institutional Review Board (IRB00088492: Outcome Measures and Biomarkers Development for Rett Syndrome and Multisite development of standardized assessments for use in clinical trials). A written informed consent was received prior to the participation from the parents of the patients.

Missing data
Considerable amounts of missing data were present in the dataset due to the following reasons: (1) Device charging and data upload, (2) Motion artifacts, and (3) In some cases, low compliance by the caretakers. The caretakers were trained in-clinic for data collection. Per protocol, caretakers collected approximately an hour of in-clinic data from the patients when learning how to use the devices. The caretakers were then asked to record data at home for up to seven days in intervals up to 24 hours at a time. The Biostamp sensor needed removal during bathing. Further, the device needed to be fully charged every 20 to 24 hours depending on the device age and a full charging of the device took up to two hours. To minimize data loss, the caretakers were advised to charge the devices while the patients were bathing. However, due to various factors such as forgetfulness and technological difficulties caretakers could have exhibited lower compliance. To tackle the missing data issue we implemented three signal imputation techniques to improve data quality and increase the amount of available data for analysis. Namely, 1. Signal quality index-based ECG data imputation.
2. Data imputation for activity counts.

Stochastic surrogate data imputation.
Signal quality index-based ECG data imputation. The ECG signal recorded using the wearable patches contained sections of data corrupted by motion artifacts. To improve data quality, we sourced data from multiple sensors. As discussed earlier, we simultaneously captured ECG from up to four unique locations on the body. For a time-window t, say we had N (N �4) ECG signal snippets recorded from N locations, we chose the one signal snippet with the highest Signal Quality Index (SQI). The SQI for ECG provides the percentage of beats that match when detected by multiple annotation generators with highly differing noise responses [31]. We refer the readers to Li et. al [32] for a detailed explanation of the SQI computation algorithm. If this value was greater than 0.75, we used it in our analysis; otherwise, we discarded all ECG data for the time-window t. By switching between N signals for each time-window t to form a single 1−D ECG signal, we maximized the amount of good data available for the analysis.
Data imputation for activity counts. The activity count signal was extracted from the three-axis acceleration in order to further extract actigraphy metrics using the Oakley's method [33]. When analyzing activity count signal in isolation, as per the scripts provided in the actigraphy toolbox [34], we combined the data to obtain one value per hour. When there was no data in a given hour, we imputed those samples using the Piecewise Cubic Hermite Interpolating Polynomial (PCHIP) interpolation as recommended by Cakmak et al. [35].
Stochastic surrogate data imputation. The computation of transfer entropy and multiscale network representation required us to impute the missing data in the 48-hour HRV features and the activity count signals. For this, we developed the surrogate data imputation method, a stochastic technique developed to impute missing data in a timeseries using data in the vicinity of the missing sections (or gaps). The data imputation algorithm works as follows. Given a time series (S) and its timestamps (t), we find all the N gaps g[i] 8i 2 {1, 2, 3, . . ., N} in S which are greater than a fixed threshold th g . The gaps are then sorted in increasing order. We impute the gaps with surrogate signal snippets in increasing order of the gap length as follows. We denote the gap length in seconds as g l , while t b and t e denote the time stamp where the gap begins and ends, respectively. Next, for each gap g[i] we draw a sample x r from the normal distribution with mean 0 and variance equal to 1, i.e., x r � N ð0; 1Þ. The normal distribution from which we picked a sample is then mapped to the timeseries S in the following way. We map the left half of the distribution (0.5 × folded normal distribution) to S where t < t b −g l and the right half of the distribution (0.5 × folded normal distribution) to S where t > t e . We illustrate this mapping of the Gaussian distribution onto the timeseries in Fig 3. Accordingly, we copy the signal snippet of length g l starting from the point in time that corresponds to x r on the timestamp signal t. We insert this copied signal in the gap g[i] and add a noise signal which is 5% of the sample sampled from a Gaussian distribution with mean (m S ) and variance (s 2 S ) equal to the mean and variance of the signal S. Further, we update both the timeseries S and the corresponding timestamps t. This procedure is repeated iteratively until all gaps {g[i]} in S greater than th g are imputed.

Feature extraction
We extracted HRV metrics from the ECG signals and actigraphy features from the three-axis accelerometer signals in the dataset. The HRV metrics were extracted using the open-source PhysioNet Cardiovascular Signal Toolbox provided by Vest et al. [31]. We extracted 24 distinct HRV metrics, including time-domain measures, frequency-domain measures, entropy measures, phase rectified signal averaging (PRSA) measures, and other non-linear metrics. We used the default window length settings provided in the toolbox and thus used a 300-secondlong feature extraction window with a 30-second shift. We used the SQI based ECG data imputation to maximize the amount of good ECG data. For a given patient-visit, we computed the mean and variance of each HRV metric between the times 10 PM and 10 AM. We chose this period to include ECG data during sleep and discard the noisier signal recorded during daytime and evenings. Hence, each HRV metric provided two features resulting in 48 features from 24 HRV metrics. We extracted the actigraphy features from the z-axis of the acceleration signal from the right hypochondrium using the open-source Actigraphy Toolbox [34]. We converted the acceleration signal to activity counts using the Oakley method described by Borazio et al. [33]. First, we used Oakley's method for converting accelerometer signals to activity count. We then extracted the following eight features using the toolbox: (1) Interday stability, (2) Intraday variability, (3) Least active 5 hours, (4) Most active 10 hours, (5) Rest activity, (6) Mesor, (7) Amplitude, and (8) Acrophase. The last three features were based on Cosinor Rhythmometry. The Actigraphy features needed two consecutive 24-hour periods (midnight to midnight) of data for feature computation. Thus, we identified the best two consecutive 24-hour periods with the least missing data for each patient-visit. If both days did not have at least 12-hours of acceleration data per day, those patient-visits were discarded. To impute missing data, we used the PCHIP interpolation.
Finally, we computed MTE and MSNR features using 2-day consecutive HR and activity count signals. We utilized the Stochastic surrogate data imputation technique to impute missing data. We processed the HR signal, deceleration capacity (DC) of the RR-interval [36] signal, and the activity count (Act) signal for computing the features. Transfer entropy depicted as TE X ! Y is a measure of directional coupling between two concurrently sampled timeseries X = {x 1 , x 2 , . . ., x N } and Y = {y 1 , y 2 , . . ., y N }. Formally, TE X ! Y is a reduction in uncertainty, given by the conditional entropy of y i given its past values minus the conditional entropy of y i given both its past values and past values of the other variable y ðlÞ iÀ w : where i indicates a given point in time, t and w are the time lags in X and Y respectively, and k and l are the block lengths of past values in X and Y respectively. The k and l were both set to 1 to improve computational speed, and t and w were both set to 1 under the assumption that the maximum auto-transfer of information occurs from the data point in X immediately before the target value in Y, and vice versa. These choices of k = l = t = w = 1 are appropriate in biomedical experiments as the absolute values of auto-correlation functions tend to decrease monotonically as time lag increases [25]. In our experiments, we computed the TE between the following signals: (1) HR-DC, (2) HR-Act, (3) DC-Act, (4) DC-HR, (5) Act-HR, and (6) Act-DC. We computed these TE values at scales 1 to 10 using the coarse-graining algorithm [37] to obtain Multiscale Transfer Entropy (MSTE) features. The probability densities for the estimation of MTE were estimated using the Darbellay-Vajda (D-V) adaptive partitioning algorithm [25,26,38]. Further, we computed 3D D-V partitioning using the HR-DC-Act signals and computed multiscale network representation (MSNR) features. The network representation features included the following 11 metrics: (1) Number of nodes (total number of nodes in the network), (2) Average degree (the average value of the degree of all nodes in the network, where the degree of a node is defined as the total number of its neighboring edges), (3) Number of loops (the total number of independent loops in the network, also known as the "cyclomatic number" or the number of edges that need to be removed so that the network cannot have cycles), (4) LOOP3 (the total number of loops of size 3 in the network), (5) is the length of the shortest path from u to v, and V is the set of all nodes; the graph radius is the minimum of eccentricities over all nodes in the network), (9) Spectral radius (the largest magnitude eigenvalue of the adjacency matrix of the network), (10) Trace (sum of the eigenvalues of the adjacency matrix, i.e., P n i l), and (11) Energy (squared sum of the eigenvalues of the adjacency matrix A. More formally, the energy of a network G is: We computed the above-described MSNR features at scales 1 to 10 using the coarse-graining technique [37]. Using the surrogate data imputation technique, we obtained 100 imputations for each patient-visit. Thus, we generated 3200 datapoints. We computed the MTE and MSNR features for all the 3200 datapoints and then computed the mean and variance over 100 imputations for each patient-visit. In the end, we

Rett syndrome severity classification experiments
We developed separate models for each feature sub-group and combinations of feature subgroups to obtain 15 classifiers corresponding to the following feature combinations: (1) HRV, (2) Actigraphy, (3) MSTE, (4) MSNR, (5) HRV + Actigraphy, (6) HRV + MSTE, (7) HRV + MSNR, (8) Actigraphy + MSTE, (9) Actigraphy + MSNR, (10) MSTE + MSNR, (11) HRV + Actigraphy + MSTE, (12) HRV + Actigraphy + MSNR, (13) HRV + MTE + MSNR, (14) Actigraphy + MTE + MSNR, and (15) HRV + Actigraphy + MTE + MSNR. We used the LASSO based logistic regression classifier to assess the performance of different feature combinations. The models were assessed via a leave-one-patient out cross-validation (LOOCV) experiment. Given that the dataset was small (n = 20), we chose the LOOCV approach against k-fold validation as with LOOCV we had a larger number of datapoints to learn the model from. The LOOCV method is approximately unbiased, because the difference in sample size between a training fold and the entire dataset is one datapoint [39,40]. However, we note that the LOOCV tends to have a higher variance. This issue can potentially be resolved with a larger dataset and we expand on this in the Discussion section of the paper. The within-training-fold cross-validation (CV) is a standard approach for variable selection and hyper parameter tuning [41,42]. Therefore, the hyperparameter tuning was performed using a within-training-threefold CV. We chose a three-fold CV instead of a larger k-fold CV to keep the number computations relatively smaller. We measured the classification performance using the metric -area under the receiver operating curve (AUC). Using this metric, we compared the 15 different classifiers and assessed their classification performance. For each feature combination, the entire classification experiment was repeated five times with different random seeds to account for variability in model coefficients that arose during hyperparameter tuning (the within-training-three-fold data split changed each time). The model's outputs for the five repetitions of the experiment were combined by computing the median value for the classification probability output. The final AUC for each feature combination was determined by comparing this median output against the ground truth labels.

Feature popularity score
Apart from measuring classification performance, we computed a novel feature popularity score for the features used in classification, which allowed us to measure feature importance and compare features. Since all classification experiments comprised 20 patients, the leaveone-patient out cross-validation approach produced 20 models. As described previously we repeated the classification experiment five times for a given feature combination, resulting in 5 × 20 = 100 classifier models per feature combination. In these 100 models, for each feature f, we counted the number of models in which it had a non-zero LASSO coefficient. For each model, most model weights were zero and only few features were picked. Thus, the normalized count of the number of times a feature was picked was a natural choice for us to express the feature popularity. Then, the feature popularity score (ρ) for the feature f was computed as: This metric measured the popularity of the feature; the greater number of times the feature was picked by the LASSO-based classifiers, the more popular the feature was. Please note that we do not use the actual model weight value associated with a given feature in computing feature popularity. Further, we refer the readers to additional reading [43][44][45] to get a comprehensive understanding of other feature selection methods.

Results
The HRV, breathing, and physical activity are thought to influence Rett syndrome severity, but the underlying correlations are yet to be measured. Our experiments investigated the effect of HRV-metrics, actigraphy, MSTE, and MSNR features on Rett syndrome severity by assessing 15 binary classification models ( Fig 4A) described in the section Rett syndrome severity classification experiments. For this, as described in the section Data collection, we utilized a cohort of 20 patients with Rett syndrome, of which 10 patients had low-severity patient-visits, nine patients had high-severity patient-visits and one individual had both low-severity and highseverity patient-visits. We illustrate this in Fig 4B.

Performance comparison of different feature combinations
The best binary Rett severity classifier used the feature combination of HRV-metrics, MSTEfeatures, and MSNR-features, and obtained a pooled-AUC equal to 0.92. When we used the four feature sub-groups separately for classification, HRV-metrics performed the best with a pooled-AUC equal to 0.76. This was followed by MSTE-features (AUC = 0.68), Actigraphymetrics (AUC = 0.63) and MSNR-features (AUC = 0.55) respectively. When we used 2-combinations of feature sets, the feature combination of HRV and MSTE performed the best with an AUC = 0.80. This was followed by the following 2-combinations of feature sets: (1)

Feature popularity
The mean deceleration capacity (μ PRSA−DC ) was the most popular feature for Rett syndrome severity classification. Using the novel formula described in the section Feature popularity score, we extracted the top-10 most popular features utilized by the best classifier (HRV, MSTE, and MSNR) and the distribution of their corresponding weights (Fig 4C). The feature μ PRSA−DC came out on top with a feature popularity score ρ = 1.00. It was followed by the

Discussion
As of 2022, no clinically meaningful disease-modifying treatments exist for patients with Rett syndrome. We instead rely on multiple therapeutics and symptomatic treatment strategies geared towards managing respiratory ailments, treating seizures, improving gastrointestinal function, and improving motor skills [46]. As new drugs and therapeutics are discovered, the need for objective measures that can be used in clinical trials increases. Neurological disorders, including Rett, suffer from difficult-to-measure symptoms. Most efficacy assessments are based on parent, clinician (and in some cases, patient) interpretation of symptom severity, which by nature introduces bias and often, exemplified by high placebo rates, results in a high noise to signal ratio. Hence, the need to establish objective measurements in patients directly, especially for symptoms that are difficult or impossible to observe, would open the door to evaluate therapeutics in novel ways and has the potential to expedite therapeutic development in multiple ways. Namely, (1) It would reduce bias; (2) It would help reduce clinical trial sample size by reducing the noise to signal ratio; and (3) It would facilitate shorter trial duration by capturing continuous data at home. The measurement of autonomic function could be an early biomarker of therapeutic efficacy and may be particularly relevant for curative therapeutics such as gene therapies that theoretically should improve or restore baseline function. If trials focus on efficacy measures that require learning and implementation (like mobility, communication, and hand use), this may take significantly longer to detect than an autonomic function correction that should not require learning. Thus, in the current study, we attempted to address these unmet needs by capturing physiological (ECG) and body activity (three-axis accelerometer data) from a 20-patient cohort. We chose to regress features extracted from the ECG signal and body activity measurements against the binarized CGI-S to correlate objective measures attained from a wearable biosensor to an overall symptom severity scale. We have shown that the inclusion of multiscale features (MSTE and MSNR) along with HRV-metrics provided a performance improvement of 21% in terms of the AUC (AUC = 0.92) when compared to using HRV-metrics alone (AUC = 0.76). This improvement is attributable to the information embedded in the higher-order interactions between the HR, DC, and activity count signals. While the transfer entropy-based features enabled us to study the 2-dimensional variations between all combinations of HR, DC, and activity count signals in both directions, the network representations-based features enabled us to study the 3-dimensional interactions between these signals. Further, the computations of these features at multiple coarse-graining scales provided a means to analyze the signal modulations occurring due to different physiological phenomena (at different timescales). The coarse-graining scales we used in this study ranged from 1 to 10, which corresponded to a variation of sampling rate from 1/30 Hz to 1/ 300 Hz (i.e., from a sample every 30 seconds to a sample every 5 minutes). It allowed us to study the effect of different physiological processes, including respiration, vagal activity, and circadian rhythm, on the interactions between HR, DC, and activity count signals. Our analysis suggested that the mean value of the DC of the HR signal captured on the BioStamp 1 nPoint between 10 PM and 10 AM was the most popular feature to classify lowseverity patients from high-severity patients. It was the most popular feature in all the eight classifier models in which it was used, with a consistent and highest feature popularity score of 1. The computation of DC involved synchronizing the phases of all periodic components of the signal irrespective of their timescales [47]. The DC captured the overall deceleration of the sinus rhythm due to physiological processes that occurred at different timescales, including the vagal (parasympathetic) activity, the baroreflex, and the circadian rhythm. In Fig 5B, we depicted the variation of the 5-minute averages of DC between the times of 10 PM and 10 AM for both low-severity and high-severity patient visits. Apart from the mean value of the DC, we observed the emergence of the variance of transfer entropy at 8 th coarse-graining scale from (1) Activity count signal to the HR signal and (2) Activity count signal to the DC signal among the most popular features. Further, the variance of the number of loops of size 4 in the network representations at coarse-graining scales 3 and 5 were among the top−5 most popular features in the best classifier model (AUC = 0.92). Thus, in Fig 5C, we illustrated an example of the network representations for a low-severe and a high-severe Rett patient at the 3 rd , 5 th , and 8 th coarse-graining scales. In addition to demonstrating the viability of a wearable biosensor to estimate disease severity based on objective measurements directly in patients, another novelty in our work was handling missing data in signals captured using wearables for patient state analysis. We proposed three different techniques for this purpose. The first method of combining data from multiple channels boosted the amount of available data by 13.4% when compared to using a single channel (medial chest) and improved the average SQI by 10%. In Fig  5A, we illustrated this improvement in SQI for each of the 32 patient-visits due to the usage of the novel SQI based ECG imputation. The second method helped impute the activity count data and reduce missingness. The third and final technique of generating stochastic surrogates enabled us to compute the MSTE and MSNR features. It was specifically developed for this study, without which no multiscale analysis could have been performed. We computed the sample mean and sample variance of all MSTE and MSNR metrics across all surrogate representations and used them as features for our classification models. The sample mean is a measure of the samples' central tendency, and the sample variance is a measure of the spread of the samples. Since our method used a normal distribution to sample the random variables, and a normal distribution can be completely defined by a mean value and variance, it was rational to use the sample mean and variance to characterize the underlying MSTE and MSNR values.
The small sample size (n = 20) presented us with the challenge of choosing between binary and multi-class (seven-class) classification. The number of patients were not evenly distributed across the seven levels of the scale, and therefore it made regressing on these items very hard (and likely to skew to the more well represented classes). Moreover, with so few individuals, a binary task provided us with more datapoints per class (n�10 datapoints per class). Thus, separating patients by general severity into two categories, with the more mild patients in one group and the more severe patients in another group, was the best way to organize patients in a clinically meaningful way that would provide sufficient numbers. Identifying a binary signal that can distinguish more severe from less severe patients is meaningful for clinical development. We intend the binary outcome to result in a binary intervention (e.g. increase medication or not), and that a continuous (integer) scale makes it hard to know whether to treat someone or not. Ideally if there were enough patients it would be great to tease apart differences between each gradation on the severity scale but this was not feasible in such a small study and unlikely to be accomplished in a rare disorder. Further, all the models validated in our experiments involved training via a leave-one-patient out cross-validation. The small number of participants in the study and consequently the validated models potentially may not extend with a similar high performance on an external dataset. This is a limitation of our work and the further validation of trained models and features learned in our study on an external and larger dataset is part of our future work. However, Rett syndrome being a rare disorder (affects 1 in 10000 females) introduces the practical constraint of difficulty in recruiting large cohorts (n>100) for a single study. A potential solution to address this issue could be to create simultaneous ECG and three-axis acceleration datasets from Rett syndrome patients and to create a corpus of these datasets on open source repositories such as Physionet [48]. We did not compare our L1-regularized logistic regression method with different classifiers including random forests, k-nearest neighbors and XGBoost for binary classification or feature importance measurements. The comparison of various classifiers for a given classification task can be useful in determining the best model as well as improve our understanding of the underlying features. This is a limitation of our current study and is something we will explore in the future. Additionally, due to the large number of potential features, the feature importance measurement procedure can be made statistically robust by performing repeated test correction [49]. However, as the purpose of the study was exploratory in nature [50], and the cohort size (N = 20) was rather small, we did not perform the repeated test correction.

Conclusion
We developed a machine learning model that utilized features characterizing HRV, body movement, and the interaction between the two to estimate Rett syndrome severity in a group of 20 female Rett patients. For this, we developed a novel stochastic method for biosignal data imputation. We obtained the highest pooled-AUC equal to 0.92 utilizing the feature combination of HRV-metrics, MSTE-features, and MSNR-features. Further, the proposed approach provided us with physio-motor biomarkers that could be used in clinical trials as objective metrics to quantify the improvement in overall patient state. Specifically, the mean DC of the HR signal captured between 10 PM and 10 AM using the BioStamp 1 nPoint biosensor was the most popular feature with a feature popularity score equal to 1. In conclusion, our study (1) Implemented a novel data imputation technique for physiological signals, (2) Developed a machine learning model to estimate Rett disease severity, and (3) Developed objective measures that characterize the autonomic function in Rett syndrome.