Figures
Abstract
The promise of machine learning successfully exploiting digital phenotyping data to forecast mental states in psychiatric populations could greatly improve clinical practice. Previous research focused on binary classification and continuous regression, disregarding the often ordinal nature of prediction targets derived from clinical rating scales. In addition, mental health ratings typically show important class imbalance or skewness that need to be accounted for when evaluating predictive performance. Besides it remains unclear which machine learning algorithm is best suited for forecast tasks, the eXtreme Gradient Boosting (XGBoost) and long short-term memory (LSTM) algorithms being 2 popular choices in digital phenotyping studies. The CrossCheck dataset includes 6,364 mental state surveys using 4-point ordinal rating scales and 23,551 days of smartphone sensor data contributed by patients with schizophrenia. We trained 120 machine learning models to forecast 10 mental states (e.g., Calm, Depressed, Seeing things) from passive sensor data on 2 predictive tasks (ordinal regression, binary classification) with 2 learning algorithms (XGBoost, LSTM) over 3 forecast horizons (same day, next day, next week). A majority of ordinal regression and binary classification models performed significantly above baseline, with macro-averaged mean absolute error values between 1.19 and 0.77, and balanced accuracy between 58% and 73%, which corresponds to similar levels of performance when these metrics are scaled. Results also showed that metrics that do not account for imbalance (mean absolute error, accuracy) systematically overestimated performance, XGBoost models performed on par with or better than LSTM models, and a significant yet very small decrease in performance was observed as the forecast horizon expanded. In conclusion, when using performance metrics that properly account for class imbalance, ordinal forecast models demonstrated comparable performance to the prevalent binary classification approach without losing valuable clinical information from self-reports, thus providing richer and easier to interpret predictions.
Author summary
Symptoms associated with mental health disorders vary greatly over time. Periods of partial remission unfortunately alternate with relapses defined by a marked worsening of symptoms. Hence, assessing future risk and adopting preventive measures is a key challenge for clinical psychiatry. With their many sensors, smartphones can provide novel insights into human behavior outside the medical office. By using machine learning, a branch of artificial intelligence, it is possible to use such smartphone sensor data to predict future mental states and symptoms in psychiatric patients. The present work highlights the importance of predicting fine-grained levels of symptom severity, as commonly reported by patients using so-called ordinal rating scales. Such ordinal predictions were not less accurate than the simplified binary predictions (on/off, high/low) often reported in previous efforts. Besides, we underscore that severe mental states are rare compared to healthy ones, and that this imbalance brings methodological challenges that need to be taken into account to develop valid predictive models.
Citation: Jean T, Guay Hottin R, Orban P (2025) Forecasting mental states in schizophrenia using digital phenotyping data. PLOS Digit Health 4(2): e0000734. https://doi.org/10.1371/journal.pdig.0000734
Editor: Dhiya Al-Jumeily OBE, Liverpool John Moores University - City Campus: Liverpool John Moores University, UNITED KINGDOM OF GREAT BRITAIN AND NORTHERN IRELAND
Received: December 1, 2023; Accepted: December 22, 2024; Published: February 7, 2025
Copyright: © 2025 Jean et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: De-identified digital phenotyping data from the CrossCheck study are openly available from: https://pbh.tech.cornell.edu/data.html. The Python code used to generate all the results reported in this paper can be obtained from a dedicated GitHub repository: https://github.com/zilto/ordinal-forecasting-digital-phenotyping.
Funding: TJ received student fellowships from the Fonds de recherche du Québec - Santé (#303584) and the Canadian Institute for Health Research. PO was supported by a salary award “chercheur boursier junior 1” of the Fonds de recherche du Québec - Santé (#266630, #280391) and the Courtois foundation through the Courtois NeuroMod project (https://www.cneuromod.ca). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Although severe psychiatric disorders such as schizophrenia are often chronic, they are also notoriously temporally dynamic, with the severity of symptoms varying over time [1,2]. In a uniquely individual way, periods of partial remission alternate with recurrent relapses defined by a marked worsening of symptoms [3–5]. Monitoring a patient’s symptom trajectory and predicting future risks are therefore key clinical tasks in order to implement required preventive measures [6]. Unfortunately, routine medical appointments provide too few and distant observations to adequately monitor complex individual temporal dynamics [7,8]. Additionally, clinical information is primarily collected through interviews in the medical office, which has limited generalizability to the patient’s day-to-day life [9] and heavily depends on the partly flawed patient’s memory [10]. Digital phenotyping holds promise in this regard, as it allows continuously characterizing human behavior and mental health outside the medical environment using smartphones [11,12]. First, patients can use their device to periodically rate their symptoms on clinical scales as their daily life unfolds, the ability of remotely tracking symptoms over time leading to improved clinical outcomes [13]. Second, digital phenotyping leverages passive data from the device’s sensors (e.g., Wi-Fi, Bluetooth, GPS, accelerometer) to render rich facets of behavior. For instance, sedentariness may be associated with low GPS activity, and sleep disruption may be represented by nightly phone unlocks. Machine learning plays a key role in transforming this unprecedented volume and granularity of data into insights into mental health [14]. Critically, machine learning may be further exploited to develop models that accurately predict future fluctuating symptoms (e.g., frequency of hallucinations) and acute events (e.g., hospitalisations), which could improve clinical practice in the future [15–17].
Garcia-Ceja et al. [18] distinguish three types of studies that demonstrate the relevance of digital phenotyping to the characterization of mental illness: association studies merely explore the statistical relationships between inputs (e.g., sensor data) and a target (symptom level); detection studies use inputs to predict with machine learning the target at the current time, akin to diagnosis; and forecasting studies use inputs to predict with machine learning a target in the future, similar to prognosis. First, research validated the presence of statistical associations between passive sensing data and mental states, both in healthy [19] and clinical populations [20,21]. For instance, relevant digital markers can be established to differentiate healthy from clinical populations for accelerometry [22] and mobility [23,24] features, as highlighted in a review of 46 studies on this topic [25]. Second, several studies have demonstrated that mental health-related outcomes can be successfully predicted from smartphone sensor data using machine learning. In major depression, participants with an established diagnosis can be distinguished from non-depressed participants [26] and the absence or presence of specific symptoms can be predicted in this clinical population [27]. Similar works led to symptom-level detection in bipolar disorder [28] and schizophrenia samples [29]. The binary accuracy of such predictive models ranged from 65% to 98% across 40 studies [30]. The large majority of these studies used supervised machine learning, either classification or regression, with gradient-boosted decision trees, support vector machines, linear models, and neural networks being the most commonly used algorithms, in that order. Third, forecast studies providing predictions about future health outcomes have also been published, although they are scarcer than association and detection studies. The feasibility of predicting future mood and stress in a healthy population has been replicated a few times [31,32]. Similar approaches were successful in predicting clinical scale scores and specific psychiatric symptoms for depression [33], bipolar disorder [34], and anxiety [35]. The predictive task was either binary classification (i.e., low/high categories) or continuous regression (i.e., an outcome score), with the forecast horizon extending up to a week in the future. Most of the aforementioned forecast studies investigated the predictive performance of recurrent neural networks amongst other machine learning algorithms, in line with the idea that these types of algorithms are best suited for a forecasting task given their ability to model long-term dependencies and latent variables [36,37]. Despite their success in detection studies, gradient-boosted decision trees models were not thoroughly investigated for forecasting.
To date, detection and forecasting studies have focused on solving binary classification or continuous regression tasks even though the target of the prediction often comes from ordinal rating scales [18,25]. Consequently, the resulting binary or real-numbered predictions do not match the ordinal scale interpretation guidelines nor refer to well-defined constructs, leaving key clinical information behind. Previous work evaluated XGBoost models on the same dataset for the tasks of binary classification, continuous regression, and multiclass classification [38]. While multiclass classification preserves the original response items, it loses their ordering and faces the rank inconsistency problem [39]. Ordinal regression (or ordinal classification) models preserve classes and ordering resulting in rank-consistent discrete predictions easy to interpret with existing validated guidelines. Implicitly, binary classification and continuous regression are often used to mitigate the effect of the small number of examples per class (i.e., class imbalance) on performance. Alas, the data processing inequality from information theory states that variable transformations such as binarization cannot increase the variable’s information content [40,41]. Possible gains in predictive performance come at the cost of solving a problem that ignores nuances of the collected data. Still, transforming ordinal scale ratings into a binary target may be a well-motivated modelling decision if done based on a scale’s interpretation guidelines and not merely to simplify the predictive task or reduce class imbalance [26,42]. For all learning tasks, dedicated evaluation metrics are required to properly evaluate model performance when dealing with class imbalance [43].
Our first objective was to assess the potential performance cost of using ordinal regression compared to binary classification to forecast future mental states, using passive sensing data exclusively. We investigated the potential mediating effect of binarization on the relationship between class imbalance and performance. Our second objective was to provide a comprehensive benchmark of recurrent neural networks and gradient boosted decision trees models for digital phenotyping forecasting to question the implicitly assumed superiority of the former, while systematically exploring the effect of the forecast horizon on predictive performance. To this end, we used the publicly available digital phenotyping dataset CrossCheck [44,45]. Previous studies using these data have explored the relationship between passive sensing data and self-reported mental states [29,44–51], but none have specifically addressed the importance of preserving the ordinal nature of self-reports and the impact of class imbalance on performance.
Methods
Dataset
We obtained the publicly available de-identified data from the CrossCheck study released in 2020. This digital phenotyping dataset was collected as part of a randomized controlled trial (clinical trial registration: ClinicalTrials.gov, #NCT01952041) conducted at the Zucker Hillside Hospital in New York City, New York between 2015 and 2017. Ethics approval was obtained from the institutional review boards of Dartmouth College (#24356) and North Shore-Long Island Jewish Health System (#14-100B), and all psychiatric outpatients provided informed consent to participate. Inclusion criteria were a diagnosis of schizophrenia, schizoaffective disorder, or psychosis not otherwise specified; 18 years of age; a significant psychiatric event such as inpatient psychiatric hospitalization or psychiatric hospital emergency room visit within the last 12 months. Specific diagnoses were not included in the shared dataset. Data used in the present project comes from 62 patients assigned to the smartphone arm of the clinical trial. They were provided with a Samsung Galaxy S5 Android smartphone on which a mobile app continuously collected passively sensed data for up to one year.
A series of high-level passive sensing features were made available in the dataset (Table 1). Features were computed daily and separately for 6-hour periods: morning (6am–12pm), afternoon (12pm–6pm), evening (6pm–12pm) and night (12am–6am). In total, 23,551 days of passive sensing data, without any missing value across all features, were available for analysis. Participants were prompted to provide self-reports about their mental states (Table 2) every Monday, Wednesday, and Friday, with only a minority (3%) of self-reports being obtained on other days of the week. In each self-report survey, 10 distinct items asked the participant about a particular mental state over the recent past, as rated on a 4-point ordinal scale (“Not at all”, “A little”, “Moderately”, “Extremely”). There were 5 positive items for which a high score describes a positive outcome, and 5 negative items for which a high score describes a negative outcome. A total of 6,364 surveys were completed, corresponding to 63,640 mental state items being rated.
Data processing
The dataset was partitioned into time-based training, validation and test sets that accounted for temporal dependencies in the data [43,52]. The test set contained the latest 7 surveys from each participant, the validation set contained the previous 7 surveys, and the training set included all (>7) earlier surveys (Fig 1). Participants with less than 21 surveys in total were excluded. Consequently, the number of participants decreased from 62 to 61 for models predicting the next week horizon. Depending on the forecast horizon, the training set included from 5,163 to 5,307 surveys while the validation and test sets each included 427 to 434 surveys. By representing each participant equally in the test set, model evaluation was not biased by participants contributing more data. As a trade-off, the number of training examples per participant varied importantly from 9 to 181 (median = 90.5, interquartile range = 74.5). The training and validation sets were used for model development (preliminary experiments, hyperparameter tuning, etc.) while the test set served for final model performance evaluation. Since our splitting strategy does not control for distribution drift [53,54], we assessed distribution variation across time splits, especially for rarer classes.
a. The test set contained the latest 7 surveys from each participant, the validation set contained the previous 7 surveys, and the training set included all (>7) earlier surveys b. Each label to predict was paired with 3 days of input data (12 6-hour periods), separately for 3 forecast horizons (same day, next day, next week).
Since the CrossCheck dataset includes high-level features extracted from raw sensor data, our preprocessing pipeline primarily served to ensure the XGBoost [55] and LSTM [56] models received equivalent input information while meeting their respective requirements. After dataset splitting, features were standardized at the group level to Gaussian-like distributions using the Yeo-Johnson method [57]. While tree-based methods such as the XGBoost algorithm are insensitive to scaling transformations [58], this preprocessing step helps LSTM models converge [59]. For LSTM models, each self-report was paired with a sequence of 3 consecutive days divided in 6-hour periods of passive sensing data. The 3-day input sequence creation method leads to some days being present twice in the input data, and some days of the week being over-represented. The duplicated days depend on the forecast horizon used to create sequences. For XGBoost models, input sequences were reshaped into tabular format.
Forecasting
We aimed to predict future self-reported mental states using passive smartphone data exclusively. Past self-reports were not included in the models’ input contrary to predictive models described in some previous work [31–33]. In total, 120 distinct machine learning models were trained. We forecasted 10 mental states (Table 2) over 3 forecast horizons (same day, next day, next week) with 2 machine learning algorithms (XGBoost, LSTM) on 2 predictive tasks (ordinal regression, binary classification). The forecast horizon was the time gap between the input data and the predicted label, which increased from 0 day to 1 day and 7 days (Fig 1). XGBoost and other gradient-boosted decision trees algorithms were successful for regression of current day and future self-report aggregates [45,48], and future clinical scale ratings [51] on Crosscheck data. LSTM models and other recurrent neural networks have also provided accurate forecasts of mood and stress in healthy subjects [31,32] and of depressive states in self-identified depressed individuals [33].
Learning task
Ordinal regression.
Like multiclass classification, ordinal regression involves multiple discrete classes, and like continuous regression, it considers an ordering of values. To predict values from (0, 1, 2, 3) corresponding to “Not at all”, “A little”, “Moderately” and “Extremely”, our XGBoost implementation simultaneously learned a continuous regression task and tuned the default thresholds (0.5, 0.15, 2.5) to discretize the continuous predicted values into (0, 1, 2, 3). For LSTM models, we used the Conditional Ordinal Regression for Neural networks approach and their open source implementation in coral-pytorch [39]. The neural network architecture allows a single model to decompose the ordinal regression task of predicting values (0, 1, 2, 3) into 3 independent binary tasks of predicting >0, >1, and >2. XGBoost and LSTM models were respectively trained using the regression squared loss and the conditional ordinal regression loss for neural networks. The performance of both model types was optimized for the macro-averaged mean absolute error (MAMAE), which is robust to class imbalance [60] observed in the CrossCheck dataset. This metric computes the mean absolute error (MAE) per class then averages results, giving equal weight to each class (S1 Text). For the sake of comparison with previous works, models were also evaluated using regular MAE which does not appropriately handle class imbalance [18].
Binary classification.
The original 4 classes (“Not at all”, “A little”, “Moderately” and “Extremely”) were binarized using the cutoff resulting in the two best-balanced classes. Due to the skewed nature of the original label distributions, this consisted in contrasting one class against the other 3, except for one variable. Specifically, “Extremely” was converted to “Higher” and other labels to “Lower” for positive items, while “Not at all” was converted to “Lower” and all other labels to “Higher” for negative items. Due to a flatter distribution, the variable Social had “Extremely” and “Moderately” converted to “Higher” and the other two values to “Lower”. Binary classes remained imbalanced to some degree, very much so in some instances (Fig 2). Both XGBoost and LSTM models were trained using the binary cross entropy loss and evaluated with balanced accuracy (BAcc) to deal with class imbalance [61–63]. BAcc is the arithmetic mean of sensitivity (true positive rate) and specificity (true negative rate). For the sake of comparison with a large part of the relevant literature [30], models were also evaluated using regular accuracy (Acc).
The left and right panels show the distributions of the original ordinal labels and recoded binary labels, respectively. Each row is associated with a mental state and each inner column with a forecast horizon. Bar plots display the proportion (%) of examples for the 4 or 2 classes (labels) in each dataset split.
Model training and validation
For each XGBoost model, the development phase included an automated 10 rounds of hyperparameter optimization using the Optuna library with the TPESampler over the train and validation sets [64]. Once the best hyperparameters were determined, the final model was trained on both the train and validation sets and evaluated on the test set. For each LSTM model, the PyTorch Lightning Tune functionality was used at the beginning of training to find the optimal learning rate and batch size [65]. Given the important computational cost of searching the hyperparameter space, a partial grid search was conducted along with a manual inspection of training and validation curves. The following hyperparameters were selected for achieving good generalization across mental states, forecast horizon, and task: 150 epochs and 1 LSTM layer with 128 nodes and 10% dropout. While this architecture may be considered to have lower capacity, it is well in line with previous relevant works [31–33].
Statistical testing
Statistical analyses were conducted to assess if models performed significantly above baseline and compare them across algorithms, forecast horizons, mental states, and learning tasks. For each of the 60 conditions (2 tasks, 3 forecast horizons, 10 mental states), a baseline distribution of performance scores (MAMAE or BAcc) was generated using a Monte Carlo method. To build a performance distribution, 1,000 prediction samples were drawn with replacement from the training set and evaluated against the test set (S2 Text). This baseline represents “best guess” predictions based on previous mental state self-reports and constitute a more challenging baseline than random/chance-level predictions. A model significantly outperformed the baseline if its test performance was better than the baseline quantile corresponding to a p value <.05 with a Bonferroni correction for 120 models (p < 4.2 x 10−4). Considering ordinal regression and binary classification tasks separately, model performances on the test set were compared between the XGBoost and LSTM algorithms using the non-parametric Wilcoxon signed-rank test and across forecast horizons (3 horizons) and mental states (10 variables) with non-parametric Friedman tests [52,66]. To compare performance between the ordinal regression and binary classification tasks, the MAMAE and BAcc values were transformed to a scale-normalized balanced error which ranges from 0 to 1 (S1 Text). The relationship between the scale-normalized balanced error of ordinal and binary models was compared to a perfect correlation and the residuals were inspected for discrepancies between the two learning tasks. Finally, the effect of class imbalance was assessed using the Spearman rank correlation between the predictive performance and the class imbalance of each of the 10 mental states. Class imbalance was quantified as the difference between the number of examples of the majority and the minority class, normalized by the total number of examples (0 to 1 range). The same definition was applied for ordinal regression and binary classification.
Results
Descriptive statistics
Participants tended to rate more frequently high (“Extremely”) on positive items (Calm, Hopeful, Sleeping, Social, Think) and low (“Not at all”) on negative items (Depressed, Harm, Seeing Things, Stressed, Voices) (Fig 2). With 76% to 81% of labels belonging to “Not at all” and only 3% to “Extremely”, the class imbalance for the negative items Harm, Voices and Seeing Things was major. A lesser but still considerable imbalance between majority (33% to 48%) and minority (7 to 11%) classes was observed for positive items. Recoding ordinal classes into binary classes reduced imbalance, yet it remained very large in some cases (Harm, Voices and Seeing Things) with 76–81% for the majority class (“Lower”) and only 19–24% for the minority class (“Higher”). Upon visual inspection, no notable distribution shifts were observed between the training, validation and test sets created via time-based splitting (Fig 2). The negligible variation between splits suggests they are representative of the full dataset and the validation split will properly estimate test performance.
Forecasting performance
For ordinal regression, 45 out of 60 models across algorithms, forecast horizons and mental states performed significantly above baseline (Bonferroni corrected P < .05) (Fig 3, S1 Fig). Non-significant models were associated with negative mental states (Depressed, Harm, Seeing things, Stressed, Voices) and the LSTM algorithm. MAMAE values ranged from 1.36 to 0.77 (median = 1.04) across all models, with the 45 significant models not exceeding MAMAE = 1.19 (median = 0.96). There was a significant difference in performance between the 2 algorithms (Z = 1, P < .001) with XGBoost models (median = 0.94) performing better than LSTM models (median = 1.08). Similarly, a significant effect of forecast horizon was observed (Q = 19.9, P < .001) although the decrease in performance as the horizon increased from same day (median = 1.02) to next day (median = 1.03) then next week (median = 1.04) was very small.
Each row is associated with a mental state and each outer column with a forecast horizon. For each of the 30 conditions, bar plots show XGBoost and LSTM test predictions. The bar height corresponds to the number of examples with each bar indicating the true class and colors reflecting the predicted classes. The MAMAE for these XGBoost and LSTM model predictions are shown against the MAMAE significance threshold (Bonferroni corrected P < .05) and the corresponding Monte Carlo baseline distribution. n.s., non-significant.
On the binary classification task, 58 out of 60 models were significantly superior to baseline (Bonferroni corrected P < .05) (Fig 4, S2 Fig). BAcc values ranged from 54% to 73% (median = 66%) with significant models all performing above 58%. Contrary to ordinal regression, no significant difference in performance was found (Z = 150, P = .1) between the XGBoost (median = 66%) and LSTM (median = 66%) algorithms. A significant effect of the forecast horizon was detected (Q = 7.9, P = .02). The decrease in performance over same day (median = 66%), next day (median = 66%), and next week (median = 65%) was consistent with the effect observed for ordinal regression. When comparing the scale-normalized balanced error of ordinal regression and binary classification to a perfect correlation, the very low average of residuals (= 0.003) suggests equivalent performance on the two tasks on average, with no predictive task clearly outperforming the other (Fig 5).
Each row is associated with a mental state and each outer column with a forecast horizon. For each of the 30 conditions, bar plots show XGBoost and LSTM test predictions. The bar height corresponds to the number of examples with each bar indicating the true class and colors reflecting the predicted classes. The BAcc for these XGBoost and LSTM model predictions are shown against the BAcc significance threshold (Bonferroni corrected P < .05) and the corresponding Monte Carlo baseline distribution. n.s., non-significant.
For each of the 60 conditions (10 mental states x 2 algorithms x 3 forecast horizons), the performance of the ordinal regression model (y-axis) is displayed against the binary classification model (x-axis) under the same condition. Balanced performance metrics (MAMAE, BAcc) were normalized to a common scale ranging from 0 to 1 (see S1 Text).
Class imbalance
Very strong effects on performance were unravelled for mental states, both for ordinal regression (Q = 49.4, P < .001) and binary classification (Q = 47, P < .001) (Figs 3 and 4). Further inspection revealed that the effect of mental states could be explained by their class imbalance, as it correlated with MAMAE values of ordinal regression (Spearman r = .8) and BAcc for binary classification (r = −.56) (Fig 6). Large class imbalance observed for the variables Harm, Voices and Seeing Things, was associated with poor performance (high MAMAE, low BAcc). In stark contrast, opposite effects were observed when using metrics that do not account for class imbalance (MAE, Acc). Indeed, large class imbalance was associated with higher performance, both for the ordinal regression MAE (r = −.8) and the binary classification Acc (r = .72). As a corollary, performance metrics that do and do not account for class imbalance should be negatively correlated, which held for MAMAE and MAE (r = −.66) as well as BAcc and ACC (r = −.18). The latter association was weaker given 7 out of 10 mental states were fairly balanced and their BAcc and Acc strongly positively correlated (r = .95).
Each scatter plot displays the test performance of 60 models under different conditions (10 mental states x 2 algorithms x 3 forecast horizons). Rows show the effect of class imbalance on performance for ordinal regression (top) and binary classification (bottom). The left and middle columns respectively highlight the relationship between class imbalance and predictive performance for balanced and unbalanced metrics, and the right column reveals the relationship between the two metrics. Each correlation is quantified by Spearman’s ρ (rho).
Discussion
We show that ordinal regression, which best preserves key clinical information, can forecast self-reported mental states from passive smartphone data with predictive performance levels comparable to those of binary classification. Class imbalance, which may be particularly pronounced for ordinal data, strongly affects model training and performance is inadequately rendered when unfit evaluation metrics are used. The XGBoost performs as well or even better than the LSTM algorithm for forecasting. Increasing the forecast horizon incurred a negligible decrease in performance.
While mental states are often collected on ordinal rating scales in digital phenotyping research, the majority of past studies formulated predictions using binary classification [24,26,32] or continuous regression [29,48,51]. Ordinal models adequately preserve the order of discrete classes without assuming continuity between them. Importantly, ordinal predictions exist on the data collection scale and can be interpreted by clinicians using the scale’s interpretation guidelines. Critically, our findings demonstrate that using an ordinal regression modelling that best meets this clinical motivation does not lead to any systematic cost in predictive performance compared to using binary classification.
The rarity of clinically relevant mental-health events (e.g., psychotic episode) leads to datasets composed mainly of healthy examples, both in terms of inputs and labels. In this study, mental states with larger class imbalance (e.g., Harm, Voices, Seeing things) were associated with lower performance, a typical challenge in machine learning [67]. Since imbalance is a core property of mental health data, it is best to leave it unadjusted and use adapted methods [63]. The majority of past digital phenotyping studies only report performance metrics that do not account for class imbalance (e.g., Acc, root mean squared error, MAE) [30], thereby allowing models that predict only the majority class to appear deceptively good. To overcome these issues, we used metrics that account for class imbalance (BAcc and MAMAE) to train and evaluate models. We show that models appear to perform best on variables describing rare events (e.g., Harm, Voices, Seeing things) when using unbalanced metrics, but that relationship is reversed when using the adequate balanced metrics. Consequently, inappropriate metrics can lead to systematically incorrect conclusions when evaluating model performance or the impact of procedures such as feature engineering, feature selection, or model selection. Besides, resampling techniques have been used in past digital phenotyping studies in an attempt to improve performance by mitigating class imbalance [68]. However, these methods provide little to no benefits for modern algorithms like XGBoost [69]. Instead, tuning the model’s decision threshold is a more sensible approach. Resampling provides no performance improvement for binary models of medical diagnosis and deteriorates model calibration, which should be a key performance criterion when a probabilistic interpretation is necessary (e.g., risk scores) [70]. Cost-sensitive learning, which incorporates the outcome of a prediction into the learning process, has also been shown to be an effective solution to overcome class imbalance [71].
Recurrent neural networks, in particular the LSTM approach, are specialized for sequence data such as sensor data [36,37] and are popular forecasting algorithms in digital phenotyping [32,33,35,72]. On the other hand, gradient-boosted decision trees have been consistently successful in diagnostic studies [45,51,73,74], but few investigated them for forecasting [48,75]. Our results show that XGBoost models are equally capable as LSTM models for forecasting, and even superior under certain conditions. Gradient boosted decision trees were previously found to be superior to neural networks on tabular datasets with skewed features, uninformative features, or rare classes [76,77], all typical characteristics of digital phenotyping. Given similar levels of performance, algorithms should be selected according to other practical implications such as explainability, the amount of required training data, or computational costs.
Increasing the forecast horizon from “same day” to “next week” was associated with a negligible performance degradation, in line with previous findings [31,34]. Given the strong weekly seasonality in behavior and mental health self-reports [31,33], one should proceed with care when extrapolating performance to different days of the week. For instance, the CrossCheck study collected self-reports primarily on Monday, Wednesday, and Friday, limiting our ability to train more robust models. Besides, increasing the input history length up to 3 weeks could improve model performance [31–33,48,68,78], given early warning signs have been observed up to 30 days prior to symptom worsening in mental illness [47,79,80].
Although the levels of forecast performance we achieved are encouraging, predictions are not sufficiently accurate to consider their implementation in the clinical realm. Furthermore, calibrating and evaluating model performance on data from various contemporary smartphone devices with differing sensor quality is a crucial step before any applied use of the predictive models we describe. Future work is thus required to significantly improve the performance of forecast models using digital phenotyping data, whether by exploiting much larger datasets, improving feature engineering, or redefining more optimal clinical targets, among other things. For instance, different mental states can be correlated [68] and modelling techniques able to leverage this information can improve forecasting performance [29]. Multiple-output methods are possible for both tree-based and deep-learning models, but the intersection with ordinal regression methods has yet to be investigated.
Furthermore, it will be key to explain why a given model makes a specific prediction, a critical subject the present study did not explore. Explainable machine learning systems will indeed be necessary for clinicians to be able to decide whether to trust or not a prediction and to comply with regulations in some jurisdictions [81]. A popular technique is to attribute SHapley Additive exPlanations (SHAP) values [82] to determine the contribution of each feature towards a prediction. However, Shapley additive explanations can be misleading since they are detached from prediction certainty. Poorer model calibration associated with imbalanced data decreases their reliability. The calibrated explanations method [83] produces probabilistic prediction intervals and scores each feature based on its contribution towards the prediction and its uncertainty. Empirical studies showed that predictions and explanations need to be grounded in the users’ task and fit their mental model to be useful [84]. For digital phenotyping, this means explanations would benefit from higher-level features that relate to mental health constructs instead of features closer to raw smartphone data, which may be at odds with improving predictive performance.
Our uniform methodology was applied to train and evaluate 120 models and provide a fair and comprehensive benchmark. As a trade-off, there might be better achievable performance on each individual task. For instance, the public version of the CrossCheck dataset was used without further feature engineering or selection. Adding well-crafted features could have benefited XGBoost and LSTM models unevenly since neural networks are more sensitive to uninformative features. On another note, we only investigated group models. Each participant was represented equally in the test set to prevent the number of self-reports biasing the evaluation. However, participants with more training examples may indirectly bias results since per-person models typically perform better [18,45,51].
The unprecedented volume and richness of data about the individual generated by smartphones opens a unique window onto mental health. Beyond the precise monitoring of psychiatric conditions in everyday life, digital phenotyping data paired with machine learning models allows to forecast future mental states. Accurate prediction of the likely progression of the illness would be key to implementing personalized prevention measures. While research in this field is burgeoning, issues remain to be addressed before such forecast models can be implemented as clinical decision support tools. In this study, we explored modelling approaches that preserve the maximum of clinical information from the collected ordinal data by training ordinal regression models instead of binary classification models to forecast mental states. Importantly, we show that the clinically motivated ordinal approaches do not incur a trade-off in predictive performance. Given class imbalance is challenging for learning algorithms and is a core property of mental health data, we argue that using binary classification is an unsatisfactory mitigation method and using evaluation metrics that account for its impact is essential. Finally, we question recurrent neural networks as the de facto superior forecast algorithm and thus encourage a more systematic benchmarking of machine learning algorithms, especially gradient boosted decision trees, to predict future mental states using digital phenotyping data.
Supporting information
S1 Text. Metric definitions.
Mathematical formulas for the binary classification and ordinal regression metrics. It includes the procedure to compute scale-normalized balanced error and MAMAE along with a proof and defines the class imbalance measure.
https://doi.org/10.1371/journal.pdig.0000734.s001
(PDF)
S2 Text. Monte Carlo baseline distribution.
Pseudocode to produce baseline distributions used in statistical tests.
https://doi.org/10.1371/journal.pdig.0000734.s002
(PDF)
S1 Fig. Ordinal regression predictions.
Each row is associated with a mental state and each outer column with a forecast horizon. For each of the 30 conditions, heatmaps show XGBoost and LSTM test predictions. The cell indicates the number of predictions with x-axis being the true class and the y-axis the predicted class. The color indicates the per-class recall value on the diagonal. This figure adds context to the Fig 3 and provides raw values to allow any metrics to be computed.
https://doi.org/10.1371/journal.pdig.0000734.s003
(TIF)
S2 Fig. Binary classification predictions.
Each row is associated with a mental state and each outer column with a forecast horizon. For each of the 30 conditions, heatmaps show XGBoost and LSTM test predictions. The cell indicates the number of predictions with x-axis being the true class and the y-axis the predicted class. The color scale is normalized per column, indicating the per-class recall value on the diagonal. This figure adds context to the Fig 4 and provides raw values to allow any metrics to be computed.
https://doi.org/10.1371/journal.pdig.0000734.s004
(TIF)
Acknowledgments
The authors acknowledge the work of the CrossCheck research team in completing the original data collection study and creating the public dataset. We are grateful to Hien Nguyen for his comments on a previous draft of our manuscript.
References
- 1. Wright AGC, Woods WC. Personalized models of psychopathology. Annu Rev Clin Psychol. 2020;16:49–74. pmid:32070120
- 2. Nelson B, McGorry PD, Wichers M, Wigman JTW, Hartmann JA. Moving from static to dynamic models of the onset of mental disorder: a review. JAMA Psychiatry. 2017 May 1;74(5):528–34. pmid:28355471
- 3. Burcusa SL, Iacono WG. Risk for recurrence in depression. Clin Psychol Rev. 2007 Dec;27(8):959–85. pmid:17448579
- 4. Koopmans PC, Bültmann U, Roelen CAM, Hoedeman R, van der Klink JJL, Groothoff JW. Recurrence of sickness absence due to common mental disorders. Int Arch Occup Environ Health. 2011 Feb;84(2):193–201. pmid:20449605
- 5. Emsley R, Chiliza B, Asmal L, Harvey BH. The nature of relapse in schizophrenia. BMC Psychiatry. 2013 Dec;13:50. pmid:23394123
- 6. Cohen AS, Fedechko T, Schwartz EK, Le TP, Foltz PW, Bernstein J, et al. Psychiatric risk assessment from the clinician’s perspective: lessons for the future. Community Ment Health J. 2019 Oct;55(7):1165–72. pmid:31154587
- 7. Insel TR. Digital phenotyping: technology for a new science of behavior. JAMA. 2017 Oct 3;318(13):1215–6. pmid:28973224
- 8. Chiauzzi E, Wicks P. Beyond the therapist’s office: merging measurement-based care and digital medicine in the real world. Digit Biomark. 2021 Jul 29;5(2):176–82. pmid:34723070
- 9. Mouchabac S, Conejero I, Lakhlifi C, Msellek I, Malandain L, Adrien V, et al. Improving clinical decision-making in psychiatry: implementation of digital phenotyping could mitigate the influence of patient’s and practitioner’s individual cognitive biases. Dialogues Clin Neurosci. 2022 Jan 1;23(1):52–61. pmid:35860175
- 10. Rogler LH, Mroczek DK, Fellows M, Loftus ST. The neglect of response bias in mental health research. J Nerv Ment Dis. 2001 Mar;189(3):182–7. pmid:11277355
- 11. Onnela J-P, Rauch SL. Harnessing smartphone-based digital phenotyping to enhance behavioral and mental health. Neuropsychopharmacology. 2016 Jun;41(7):1691–6. pmid:26818126
- 12. Torous J, Gershon A, Hays R, Onnela J-P, Baker JT. Digital phenotyping for the busy psychiatrist: clinical implications and relevance. Psychiatric Annals. 2019 May 1;49(5):196–201.
- 13. Goldberg SB, Buck B, Raphaely S, Fortney JC. Measuring psychiatric symptoms remotely: a systematic review of remote measurement-based care. Curr Psychiatry Rep. 2018 Oct;20(10):81. pmid:30155749
- 14. Mohr DC, Zhang M, Schueller SM. Personal sensing: understanding mental health using ubiquitous sensors and machine learning. Annu Rev Clin Psychol. 2017 May 8;13:23–47. pmid:28375728
- 15. Chen ZS, Kulkarni PP, Galatzer-Levy IR, Bigio B, Nasca C, Zhang Y. Modern views of machine learning for precision psychiatry. Patterns (N Y). 2022 Nov;3(11):100602. pmid:36419447
- 16. Hauser TU, Skvortsova V, De Choudhury M, Koutsouleris N. The promise of a model-based psychiatry: building computational models of mental ill health. Lancet Digit Health. 2022 Nov;4(11):e816–28. pmid:36229345
- 17. Soyiri IN, Reidpath DD. An overview of health forecasting. Environ Health Prev Med. 2013 Jan;18(1):1–9. pmid:22949173
- 18. Garcia-Ceja E, Riegler M, Nordgreen T, Jakobsen P, Oedegaard KJ, Tørresen J. Mental health monitoring with multimodal sensing and machine learning: a survey. Pervasive Mob Comput. 2018 Dec;51:1–26.
- 19. DaSilva AW, Huckins JF, Wang R, Wang W, Wagner DD, Campbell AT. Correlates of stress in the college environment uncovered by the application of penalized generalized estimating equations to mobile sensing data. JMIR Mhealth Uhealth. 2019 Mar 19;7(3):e12084. pmid:30888327
- 20. Henson P, D’Mello R, Vaidyam A, Keshavan M, Torous J. Anomaly detection to predict relapse risk in schizophrenia. Transl Psychiatry. 2021 Jun;11(1):28. pmid:33431818
- 21. Ranjan T, Melcher J, Keshavan M, Smith M, Torous J. Longitudinal symptom changes and association with home time in people with schizophrenia: an observational digital phenotyping study. Schizophr Res. 2022 May;243:64–9. pmid:35245703
- 22. Strauss GP, Raugh IM, Zhang L, Luther L, Chapman HC, Allen DN, et al. Validation of accelerometry as a digital phenotyping measure of negative symptoms in schizophrenia. Schizophrenia (Heidelb). 2022 Apr 15;8(1):37. pmid:35853890
- 23. Depp CA, Bashem J, Moore RC, Holden JL, Mikhael T, Swendsen J, et al. GPS mobility as a digital biomarker of negative symptoms in schizophrenia: a case control study. NPJ Digit Med. 2019 Nov 8;2:108. pmid:31728415
- 24. Faurholt-Jepsen M, Busk J, Rohani DA, Frost M, Tønning ML, Bardram JE, et al. Differences in mobility patterns according to machine learning models in patients with bipolar disorder and patients with unipolar disorder. J Affect Disord. 2022 Jun;306:246–53. pmid:35339568
- 25. Rohani DA, Faurholt-Jepsen M, Kessing LV, Bardram JE. Correlations between objective behavioral features collected from mobile and wearable devices and depressive mood symptoms in patients with affective disorders: systematic review. JMIR Mhealth Uhealth. 2018 Aug 13;6(8):e165. pmid:30104184
- 26. Opoku Asare K, Terhorst Y, Vega J, Peltonen E, Lagerspetz E, Ferreira D. Predicting depression from smartphone behavioral markers using machine learning methods, hyperparameter optimization, and feature importance analysis: exploratory study. JMIR Mhealth Uhealth. 2021 Jul 12;9(7):e26540. pmid:34255713
- 27. Ware S, Yue C, Morillo R, Lu J, Shang C, Bi J, et al. Predicting depressive symptoms using smartphone data. Smart Health. 2020 Mar;15:100093.
- 28. Zulueta J, Piscitello A, Rasic M, Easter R, Babu P, Langenecker SA, et al. Predicting mood disturbance severity with mobile phone keystroke metadata: a biaffect digital phenotyping study. J Med Internet Res. 2018 Jul 20;20(7):e241. pmid:30030209
- 29. Tseng VW-S, Sano A, Ben-Zeev D, Brian R, Campbell AT, Hauser M, et al. Using behavioral rhythms and multi-task learning to predict fine-grained symptoms of schizophrenia. Sci Rep. 2020 Sep 15;10(1):15100. pmid:32934246
- 30. Lee K, Lee TC, Yefimova M, Kumar S, Puga F, Azuero A, et al. Using digital phenotyping to understand health-related outcomes: a scoping review. Int J Med Inform. 2023 Jun;174:105061. pmid:37030145
- 31.
Spathis D, Servia-Rodriguez S, Farrahi K, Mascolo C, Rentfrow J. Sequence multi-task learning to forecast mental wellbeing from sparse self-reported data. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Anchorage AK USA: ACM; 2019. p. 2886–94. https://doi.org/10.1145/3292500.3330730
- 32.
Umematsu T, Sano A, Picard RW. Daytime data and LSTM can forecast tomorrow’s stress, health, and happiness. 2019 41st Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). Berlin, Germany: IEEE; 2019. p. 2186–90. https://doi.org/10.1109/EMBC.2019.8856862
- 33.
Suhara Y, Xu Y, Pentland AS. Deepmood: forecasting depressed mood based on self-reported histories via recurrent neural networks. Proceedings of the 26th International Conference on World Wide Web. Perth Australia: International World Wide Web Conferences Steering Committee; 2017. p. 715–24. https://doi.org/10.1145/3038912.3052676
- 34. Busk J, Faurholt-Jepsen M, Frost M, Bardram JE, Vedel Kessing L, Winther O. Forecasting mood in bipolar disorder from smartphone self-assessments: hierarchical bayesian approach. JMIR Mhealth Uhealth. 2020 Apr 1;8(4):e15028. pmid:32234702
- 35. Jacobson NC, Bhattacharya S. Digital biomarkers of anxiety disorder symptom changes: personalized deep learning models using smartphone sensors accurately predict anxiety symptoms from ecological momentary assessments. Behav Res Ther. 2022 Feb;149:104013. pmid:35030442
- 36. Durstewitz D, Koppe G, Meyer-Lindenberg A. Deep neural networks in psychiatry. Mol Psychiatry. 2019 Nov;24(11):1583–98. pmid:30770893
- 37. Koppe G, Guloksuz S, Reininghaus U, Durstewitz D. Recurrent neural networks in mobile sampling and intervention. Schizophr Bull. 2019 Mar 7;45(2):272–6. pmid:30496527
- 38. Choudhary S, Thomas N, Alshamrani S, Srinivasan G, Ellenberger J, Nawaz U, et al. A machine learning approach for continuous mining of nonidentifiable smartphone data to create a novel digital biomarker detecting generalized anxiety disorder: prospective cohort study. JMIR Med Inform. 2022 Aug 30;10(8):e38943. pmid:36040777
- 39. Shi X, Cao W, Raschka S. Deep neural networks for rank-consistent ordinal regression based on conditional probabilities. Pattern Anal Applic. 2023 June;26(3):941–55.
- 40. Beaudry NJ, Renner R. An intuitive proof of the data processing inequality. QIC. 2012 May;12(5 & 6):432–41.
- 41. Cover TM, Thomas JA. Elements of information theory. Wiley Interscience. 2006.
- 42. Palmius N, Saunders KEA, Carr O, Geddes JR, Goodwin GM, De Vos M. Group-personalized regression models for predicting mental health scores from objective mobile phone data streams: observational study. J Med Internet Res. 2018 Oct 22;20(10):e10194. pmid:30348626
- 43.
Varoquaux G, Colliot O. Evaluating machine learning models and their diagnostic value. In: Colliot O, editor. Machine learning for brain disorders. New York, NY: Springer US; 2023. p. 601–30. (Neuromethods; vol. 197). https://doi.org/10.1007/978-1-0716-3195-9_20
- 44. Ben-Zeev D, Brian R, Wang R, Wang W, Campbell AT, Aung MSH, et al. CrossCheck: integrating self-report, behavioral sensing, and smartphone use to identify digital indicators of psychotic relapse. Psychiatr Rehabil J. 2017 Sep;40(3):266–75. pmid:28368138
- 45.
Wang R, Aung MSH, Abdullah S, Brian R, Campbell AT, Choudhury T, et al. CrossCheck: toward passive sensing and detection of mental health changes in people with schizophrenia. Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing. Heidelberg Germany: ACM; 2016. p. 886–97. https://doi.org/10.1145/2971648.2971740
- 46. Ben-Zeev D, Brian R, Campbell A, Scherer E, Hauser M, Kane J. CrossCheck [Internet]. 2020. Available from: https://pbh.tech.cornell.edu/data.html
- 47. Adler DA, Ben-Zeev D, Tseng VW-S, Kane JM, Brian R, Campbell AT, et al. Predicting early warning signs of psychotic relapse from passive sensing data: an approach using encoder-decoder neural networks. JMIR Mhealth Uhealth. 2020 Aug 31;8(8):e19962. pmid:32865506
- 48. Buck B, Scherer E, Brian R, Wang R, Wang W, Campbell A, et al. Relationships between smartphone social behavior and relapse in schizophrenia: a preliminary report. Schizophr Res. 2019 Jun;208:167–72. pmid:30940400
- 49. He-Yueya J, Buck B, Campbell A, Choudhury T, Kane JM, Ben-Zeev D, et al. Assessing the relationship between routine and schizophrenia symptoms with passively sensed measures of behavioral stability. NPJ Schizophr. 2020 Nov 23;6(1):35. pmid:33230099
- 50. Zhou J, Lamichhane B, Ben-Zeev D, Campbell A, Sano A. Predicting psychotic relapse in schizophrenia with mobile sensor data: routine cluster analysis. JMIR Mhealth Uhealth. 2022 Apr 11;10(4):e31006. pmid:35404256
- 51. Wang R, Wang W, Aung MSH, Ben-Zeev D, Brian R, Campbell AT, et al. Predicting symptom trajectories of schizophrenia using mobile sensing. Proc ACM Interact Mob Wearable Ubiquitous Technol. 2017 Sep 11;1(3):1–24.
- 52. Hewamalage H, Ackermann K, Bergmeir C. Forecast evaluation for data scientists: common pitfalls and best practices. Data Min Knowl Discov. 2023 Mar;37(2):788–832. pmid:36504672
- 53. Thiagarajan JJ, Rajan D, Sattigeri P. Understanding behavior of clinical models under domain shifts [Internet]. arXiv; 2019 [cited 2023 Nov 8. ]. Available from: http://arxiv.org/abs/1809.07806
- 54. Webb GI, Hyde R, Cao H, Nguyen HL, Petitjean F. Characterizing concept drift. Data Min Knowl Disc. 2016 Jul;30(4):964–94.
- 55. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016. p. 785–94.
- 56. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. pmid:9377276
- 57. Raymaekers J, Rousseeuw PJ. Transforming variables to central normality. Mach Learn. 2021 Mar;113(8):4953–75.
- 58.
Muller A, Guido S. Introduction to machine learning with python. 1st ed. O’Reilly Media; 2016.
- 59.
Goodfellow I, Bengio Y, Courville A. Deep learning. MIT Press; 2016. Available from: https://www.deeplearningbook.org/
- 60.
Baccianella S, Esuli A, Sebastiani F. Evaluation measures for ordinal regression. 2009 Ninth International Conference on Intelligent Systems Design and Applications. Pisa, Italy: IEEE; 2009. p. 283–7. https://doi.org/10.1109/ISDA.2009.230
- 61.
Brodersen KH, Ong CS, Stephan KE, Buhmann JM. The balanced accuracy and its posterior distribution. 2010 20th International Conference on Pattern Recognition. Istanbul, Turkey: IEEE; 2010. p. 3121–4. https://doi.org/10.1109/ICPR.2010.764
- 62. Rashidi HH, Albahra S, Robertson S, Tran NK, Hu B. Common statistical concepts in the supervised machine learning arena. Front Oncol. 2023 Feb 14;13:1130229. pmid:36845729
- 63. Thölke P, Mantilla-Ramos Y-J, Abdelhedi H, Maschke C, Dehgan A, Harel Y, et al. Class imbalance should not throw you off balance: choosing the right classifiers and performance metrics for brain decoding with imbalanced data. Neuroimage. 2023 Aug;277:120253. pmid:37385392
- 64.
Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: a next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Anchorage AK US: ACM; 2019. p. 2623-2631. https://doi.org/10.1145/3292500.3330701
- 65. Lightning AI. PyTorch lightning [Internet]. 2019. Available from: https://www.pytorchlightning.ai
- 66. Demsar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7(1–30). Available from: http://jmlr.org/papers/v7/demsar06a.html
- 67. Krawczyk B. Learning from imbalanced data: open challenges and future directions. Prog Artif Intell. 2016 Nov;5(4):221–32.
- 68.
Wang W, Mirjafari S, Harari G, Ben-Zeev D, Brian R, Choudhury T, et al. Social sensing: assessing social functioning of patients living with schizophrenia using mobile phone sensing. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. Honolulu HI USA: ACM; 2020. p. 1–15. https://doi.org/10.1145/3313831.3376855
- 69. Elor Y, Averbuch-Elor H. To SMOTE, or not to SMOTE? [Internet]. arXiv; 2022 [cited 2023 Sep 26. ]. Available from: http://arxiv.org/abs/2201.08528
- 70. van den Goorbergh R, van Smeden M, Timmerman D, Van Calster B. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. J Am Med Inform Assoc. 2022 Aug 16;29(9):1525–34. pmid:35686364
- 71. Mienye ID, Sun Y. Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Inform Med Unlocked. 2021;25:100690.
- 72. Sükei E, Norbury A, Perez-Rodriguez MM, Olmos PM, Artés A. Predicting emotional states using behavioral markers derived from passively sensed data: data-driven machine learning approach. JMIR Mhealth Uhealth. 2021 Mar 22;9(3):e24465. pmid:33749612
- 73. Saeb S, Lattie EG, Kording KP, Mohr DC. Mobile phone detection of semantic location and its relationship to depression and anxiety. JMIR Mhealth Uhealth. 2017 Aug 10;5(8):e112. pmid:28798010
- 74. Sarda A, Munuswamy S, Sarda S, Subramanian V. Using passive smartphone sensing for improved risk stratification of patients with depression and diabetes: cross-sectional observational study. JMIR Mhealth Uhealth. 2019 Jan 29;7(1):e11041. pmid:30694197
- 75. Jacobson NC, Feng B. Digital phenotyping of generalized anxiety disorder: using artificial intelligence to accurately predict symptom severity using wearable sensors in daily life. Transl Psychiatry. 2022 Aug 17;12(1):336. pmid:35977932
- 76.
Grinsztajn L, Oyallon E, Varoquaux G. Why do tree-based models still outperform deep learning on tabular data? Proceedings of the 35th International Conference on Neural Information Processing Systems [Internet]. Current Associates Inc; 2022. p. 507–520. [cited 2023 Oct 17. ]. Availabe from: https://proceedings.neurips.cc/paper_files/paper/2022/hash/0378c7692da36807bdec87ab043cdadc-Abstract-Datasets_and_Benchmarks.html
- 77.
McElfresh D, Khandagale S, Valverde J, Prasad C V, Ramakrishnan G, Goldblum M, et al. When do neural nets outperform boosted trees on tabular data? Proceedings of the 36th International Conference on Neural Information Processing Systems [Internet]. Curran Associates Inc; 2023. p. 76336–76369. Available from: https://proceedings.neurips.cc/paper_files/paper/2023/hash/f06d5ebd4ff40b40dd97e30cee632123-Abstract-Datasets_and_Benchmarks.html
- 78. Pratap A, Atkins DC, Renn BN, Tanana MJ, Mooney SD, Anguera JA, et al. The accuracy of passive phone sensors in predicting daily mood. Depress Anxiety. 2019 Jan;36(1):72–81. pmid:30129691
- 79. Barnett I, Torous J, Staples P, Sandoval L, Keshavan M, Onnela J-P. Relapse prediction in schizophrenia through digital phenotyping: a pilot study. Neuropsychopharmacology. 2018 Jul;43(8):1660–6. pmid:29511333
- 80. Cohen A, Naslund JA, Chang S, Nagendra S, Bhan A, Rozatkar A, et al. Relapse prediction in schizophrenia with smartphone digital phenotyping during COVID-19: a prospective, three-site, two-country, longitudinal study. Schizophrenia (Heidelb). 2023 Jan 27;9(1):6. pmid:36707524
- 81. Zhao W, Sun X, Wu S, Wu S, Hu C, Huo H, et al. MaGA20ox2f, an OsSD1 homolog, regulates flowering time and fruit yield in banana. Mol Breed. 2025;45(1):12. pmid:39803631
- 82.
Lundberg S, Lee SI. A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems [Internet]. Curran Associates Inc; 2017. p. 4768–4777. [cited 2021 Dec 8. ] Available from: https://proceedings.neurips.cc/paper/2017/hash/8a20a8621978632d76c43dfd28b67767-Abstract.html
- 83. Löfström H, Löfström T, Johansson U, Sönströd C. Calibrated explanations: with uncertainty information and counterfactuals. Expert Systems with Applications. 2024 Jan;246:123154.
- 84.
Amarasinghe K, Rodolfa KT, Jesus S, Chen V, Balayan V, Saleiro P, et al. On the importance of application-grounded experimental design for evaluating explainable ML methods. Proceedings of the AAAI Conference on Artificial Intelligence [Internet]. AAAI; 2024. p. 20921–9. https://doi.org/10.1609/aaai.v38i19.30082