Questionnaire-free machine-learning method to predict depressive symptoms among community-dwelling older adults

The 15-item Geriatric Depression Scale (GDS-15) is widely used to screen for depressive symptoms among older populations. This study aimed to develop and validate a questionnaire-free, machine-learning model as an alternative triage test for the GDS-15 among community-dwelling older adults. The best models were the random forest (RF) and deep-insight visible neural network by internal validation, but both performances were undifferentiated by external validation. The AUROC of the RF model was 0.619 (95% CI 0.610 to 0.627) for the external validation set with a non-local ethnic group. Our triage test can allow healthcare professionals to preliminarily screen for depressive symptoms in older adults without using a questionnaire. If the model shows positive results, then the GDS-15 can be used for follow-up measures. This preliminary screening will save a lot of time and energy for healthcare providers and older adults, especially those persons who are illiterate.


Introduction
Depressive symptoms in older adults are commonly unidentified and complicated by concurrent cognitive impairment [1]. To screen depressive symptoms in older adults, the Geriatric Depression Scale (GDS) is one of the most commonly used questionnaires. A recent systematic review and meta-analysis found that the 15-item version (GDS-15) is the most accurate compared to the shorter or longer versions [2]. By questionnaire-free variables, demographic and physical health data from routine visits can be utilized as an electronic health record (EHR) indicator to triage patients for a mental health follow-up by GDS-15. This utilization is possible because older adults with depressive symptoms may present with more physical complaints, implying a psychological change that caregivers might overlook [3]. However, the accuracy of utilizing such data for a triage test is still unclear.

Intuition
Later-life (aged 60+) depression is associated with several factors, and their assessments can utilize routine databases at the first visit of a subject to a healthcare facility. Some of these factors never change, i.e., age [25], gender [26][27][28], and past employment status (i.e., before 60 years old) [29][30][31]. A few of these factors rarely change, i.e., current employment status [31,32], education [33,34], religion [35,36], marital status [37], living status [38][39][40], and lifestyle [41]. However, many of these factors can change on a monthly to yearly basis, i.e., health status [41][42][43], morbidities [28], hearing loss [44], and oral health and missing teeth [45]. A prediction model may utilize these factors to develop a triage test for the GDS-15 at any time. At the same time, this test can reduce the screening frequency of GDS-15 by restricting respondents to only those who test positive according to the prediction model. It should be a part of an EHR system with automatic run based on pre-existing, required information in EHR.
However, developing this model under a traditional approach, i.e., using a logistic regression (LR) algorithm, may be insufficient. In addition to LR, we also need other machine learning algorithms, which is a field of science concerned with how machines learn from data [46], not limited to those based on statistical probability theory. Machine learning is a part of artificial intelligence that emulates human intellectual actions [47]. Their use is already pervasive in recognizing objects in images, transcribing speech to text, aligning internet content to user preferences, and selecting relevant search results [48]. Many fields in medicine have used this approach to predict medical outcomes, e.g., oncology [49], cardiology and critical care [50,51], and obstetrics [52]. Machine learning provides a more extensive search space to find the most-accurate model using simple predictors, e.g., routine data in electronic medical records [49,53]. This study aimed to develop and validate a questionnaire-free model to predict the GDS-15 among community-dwelling older adults by machine learning.

Study design
This study followed the guidelines for developing and reporting machine learning predictive models in biomedical research [54] (see S1 Table in S1 File) and the prediction model risk of bias assessment tool (PROBAST) [55] (see S2 Table in S1 File). The PROBAST development was according to transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) guidelines [56]. However, the PROBAST included recent findings on developing and validating multivariable prediction models, including one by any machine learning algorithm [57,58]. For clinicians, we also provided a checklist to assess the suitability of our model for clinical settings [59] (see S3 Table in S1 File). A web application (https://predme.app/pre_gds15) is available as a prototype, but the future implementation should incorporate the application into an EHR system for automatic prediction based on preexisting information. We utilized a dataset from our previous project investigating loneliness and depression in older adults. From June to September 2019, the previous project collected this dataset using a cross-sectional design from 15 community health centers (CHCs) in Kendari, Indonesia (n = 1381). All patients aged 60 years or older with clear consciousness who visited the CHCs were enrolled. We applied a random sampling technique stratified by the CHCs. Trained assessors who collected the data were blind to the study outcome. Taipei

Data source
The dataset consisted of 19 attributes which were 17 candidate-predictor variables, one grouping variable, and one outcome variable. The candidate-predictor variables were: 1) age (years); 2) gender (male/female); 3) religious beliefs (Christian/Hindu/Moslem); 4) educational attainment (illiterate/primary/secondary/high school/university/other); 5) marital status (single/ married/separated or divorced/widowed); 6) children (number of persons); 7) living status (alone/with a family member but no spouse/with a spouse only/with family member and spouse/other); 8) currently employed (no/yes); 9) previously employed (no/yes); 10) income (in Indonesian rupiah (IDR)); 11) duration of visiting the CHC (in the number of years of routine visits); 12) comorbidities (number of conditions); 13) health condition (very good/good/ fair/poor/very poor); 14) hearing problems (no/yes); 15) visual problems (no/yes); 16) oral status (very good/good/fair/poor/very poor); and 17) medication (number of prescribed drugs). We used ethnicity (Bugis-Makassar/Buton/Muna/Tolaki/non-local ethnicity) as a grouping variable for data partitioning in order to develop and validate our predictive models (see "Model Validation"). The outcome variable was depressive symptoms (no/yes), as defined in the next section.

Outcome definition
As the predicted outcome, depressive symptoms were assessed based on the GDS. There were 15 questions to obtain a score (which ranged from 0 to 15). Some items give a point if answered positively, while others give a point if answered negatively. If the score exceeds 5, the scale suggests a person has depressive symptoms [60]. A participant answered the questions with the assistance of a trained assessor. The GDS questionnaire is described in S4 Table in S1 File. The trained assessor assisted a participant in filling out the GDS questionnaire. The assessor was also blind to the predictor information. Predictor data were demographic data and routine physical health check results collected at the same time as that for GDS. Other healthcare givers collected the data without knowing the assessment results of depressive symptoms. This blinding avoided outcome leakage. It was also carefully handled for all the analytical procedures, as described after each description of the relevant procedures (see S5 Table in S1 File). The event definition for this prediction task was depressive symptoms based on the GDS-15. However, to comply with the sample size requirement of the model development (see "Predictors"), we treated the outcome with a smaller sample size between positive and negative as the event. Under-diagnosis causes missed cases of depressive symptoms when screened using the GDS-15, which leads to failure to prevent major depressive disorders. Meanwhile, over-diagnosis causes an increased frequency of the use of the GDS-15 for each older adult, which may lead to further misclassification because the repetitive screening may cause response fatigue and rush, which lead to higher measurement error [61]. Nonetheless, the risk of under-diagnosis outweighs that of over-diagnosis.

Data pre-processing
We binarized all categorical predictors into 0 or 1 for "no" or "yes" as to whether a category applied to a participant (Fig 1). All numerical predictors were standardized using the mean and standard deviation (SD) but capped at the 2.5% and 97.5% quantiles as the respective minimum and maximum values. This standardization resulted in a value range of approximately -1.96 to 1.96. Then, we applied normalization by shifting the central value (i.e., zero) to 0.5 and scaling the range by half; thus, the numerical predictors were within a range of 0 to 1. We only used the mean and SD calculated from data partitioned for model development. Standardization used these values for numerical predictors in any data partitions. Therefore, this pre-processing procedure is possible for future data. We checked for missing values in the dataset. The only missing value was that in visual problems for one participant (n = 1/1381, 0.072%). This value was missing completely at random since we got this information from routine physical health check data. We imputed the missing value using multiple imputations by the chain equation method after data transformation using only data in the same data partition. Randomly, the missing value was a part of the data partitioned for model development.

Predictors
We only used data partitioned for model development to conduct predictor extraction, representation, and selection (Fig 1). For candidate predictors, the binarized predictors were extracted only for those without a perfect separation problem in which the predictor existed in only one of the outcomes. Perfect separation may occur because of a sampling error [54]. Although this may also occur in populations, including this kind of predictor may mislead predictive modeling to choose that predictor as the strong one for predicting outcomes. Of 40 predictors after binarization, only 37 were extracted. The excluded predictors were "other" living status, "very poor" oral status, and "Hindu" religion.
We assessed redundant predictors assisted by Pearson's correlation coefficients. Two binarized predictors were highly correlated (r = 0.72), which were "living with family members without a spouse" and a "widowed" marital status. We decided to retain these variables because the correlation was near borderline and was apparently due to sampling bias. A "widowed" marital status is not necessarily living with family members, while an older adult might live alone. Both predictors were not interchangeable.
To optimize the predictive performance, we applied a dimension-reduction technique using a principal component (PC) analysis (PCA) (Fig 1). We only used the top 19 PCs based on the percent variance explained because we needed to comply with the sample size for predictive modeling based on PROBAST guidelines [58], which is 20 events per variable or candidate of predictors (see "Model Validation"). A ten-fold cross-validation procedure was applied on only data partitioned for model development. We used average values computed from ten rotated matrices of PCs to represent 37 binarized and numerical predictors into 19 PCs. We also used average values of data partitioned for model development to get those PCs for model validation. This study's resampled dimensional reduction method was already described elsewhere [62].
We also used other machine learning algorithms besides logistic regression to develop prediction models (see "Model Development"). However, the models required larger sample sizes of >50 events per variable [58]. We used the wrapper method in which we selected PCs using a logistic regression before being candidate predictors for the machine learning models (Fig  1). We applied the same hyperparameter tuning strategy of the LR for this predictor selection (see "Model Development").

Model development
Although there are abundant machine learning algorithms for model development, we only partially compared the available algorithms (Fig 1). It is because a more models in comparison would be more vulnerable to a multiple-testing effect relative to the number of datasets, i.e., the best model is found simply by chance [63]. To avoid such comparison, we considered three criteria for choosing algorithms in developing the models: (1) those commonly used in clinical prediction studies, i.e., logistic regression [58], which expects a linear predictor-outcome correlation; (2) those which commonly outperformed others (177 algorithms) across 121 datasets [64], which allow a non-linear predictor-outcome correlation; and (3) our proposed neural-network algorithm [65], which pursues moderate predictive performance and deeper interpretability. A sufficient sample size was also considered according to the PRO-BAST guidelines since a small sample size was vulnerable to overfitting [58]. The three types of algorithms also covered those with the lowest and highest sample size requirements, which were 20 (i.e., logistic regression) and >200 (i.e., random forest [RF] and neural network) events per variable (EPVs), according to a previous study [66]. They also identified 50 and >200 EPVs for the decision tree and support vector machine. We did not use both, which neither commonly outperformed other algorithms nor required a sample size small enough for this study. Although we used algorithms that require >200 EPVs, we evaluated the models using rigorous data splitting. It would identify overfitting by comparing the evaluation results between internal and external validation sets. Both had the same and different characteristics for a particular circumstance (see Model validation), as recommended by the PROBAST guidelines [58].
In addition, we used a random-search method to tune values for the pre-defined hyperparameters. We also used those which were defined before conducting this study in a pre-registered protocol [65]. The randomness and pre-registration were deliberate to avoid a research bias, so-called "hypothesizing after the results are known (HARking)" [67]. In this study, HARking is a situation in which a set of hyperparameters for an algorithm, as a hypothesis, is preferably defined to achieve the only acceptable predictive performance in an external validation set.
We developed four models with different approaches. First, we applied the simplest model using logistic regression (LR) with a shrinkage method as recommended by the PROBAST guidelines (Fig 1). Instead of the PCs, this model used the 37 candidate predictors by an elastic net regression algorithm with L1-and L2-norm regularization. We chose this regularization method over others to minimize the excluded predictors and prevent overfitting [58]. Hyperparameter tuning of this model used a random search with up to 10 configurations of alpha and lambda values as L1-and L2-norm regularization factors, respectively. We set the factors in the tradeoff between removing and maintaining the number of predictors used for predicting the outcome; thus, we could infer which variables have predictive values under a simple predictive modeling framework.
The second and third prediction models used RF and gradient boosting machine (GBM) algorithms (Fig 1). Both are state-of-the-art algorithms that consistently outperformed other algorithms across different outcomes [68]. The RF algorithm randomly selects some predictors to build multiple classification trees using subsets of samples in parallel. Meanwhile, the GBM sequentially applies a similar algorithm. Sequential application means a later tree in GBM is used to predict misclassification of earlier ones. Both algorithms are the most used competition-winning algorithms for predictions using tabular data compared to other 177 algorithms using 121 datasets [64]. While this is not outcome-specific, predictive modeling in a competition is independently validated; thus, the predictive performances of RF and GBM are considered reliable and reasonably evaluated. Hyperparameter tuning of these models also used a random search over six configurations of the number of predictors sampled at a time for RF and number of trees, maximum depth of a tree, and shrinkage factor for the GBM. Both models also configured for minimum samples per node. We defined these hyperparameter variables in aggregate between the tree-based ensemble learners to pursue a wide range of configurations. For example, we applied a different number of predictors sampled at a time for RF while maintaining the same tree structure. Contrarily, we applied different tree structures for GBM while maintaining the same number of predictors sampled a time. The best hyperparameters were selected for each of the algorithms under a variety of samples per node to take into account the effect of sampling error. Therefore, we expected a hypothesis search of hyperparameters well-covered while avoiding the pitfall of HARking.
The last prediction model used the deep-insight visible neural network (DI-VNN) algorithm (Fig 1). It is a deep-learning model or a convolutional neural network (CNN). This model emerged in recent years because it improves predictive performance for imaging data. The Deep Insight algorithm converts a non-image into image-like data as a multidimensional array in a meaningful way using a dimensional-reduction algorithm over the predictors. The VNN means that the network architecture is data-driven because it is determined based on a hierarchical clustering algorithm over the predictors. This approach addresses criticisms of the CNN as a black-box model, i.e., it is unexplained which features and how these result in a particular prediction; yet, a CNN model can predict an outcome very well. Details about the DI-VNN pipeline were previously described elsewhere [65]. Some modifications of this pipeline were those by applying this procedure over 37 predictors and 19 PCs, resulting in 18 candidate features for DI-VNN. These were centered using each average value after quantile-toquantile normalization over all features among samples. To avoid HARking, we followed the same hyperparameter tuning approach, which was already pre-registered and thoroughly described elsewhere [65].

Model validation
Data partitioning was conducted to obtain both internal and external validation sets. Respectively, this meant we had a training set and two test sets. We used participants with ethnicity not from Sulawesi Island for the external validation set. The model was expected for use in settings not limited to those with only the local ethnicities. Hence, we should test whether the model developed using data with local ethnicities would also have an acceptable predictive performance if the model was applied to non-local ethnicities. This validation procedure may demonstrate the model's robustness in predicting outcomes in the general population [58].
We also randomly split the remaining set after excluding the external validation set. This procedure provided another external validation set with as much as~20% of the remaining set. The first to third models applied 10-fold cross-validation and 30-time bootstrapping. Respectively, both were applied for hyperparameter tuning and model training with the best hyperparameters. We also applied 10-fold cross-validation to compute the rotated matrix of PCs. Meanwhile, the fourth model applied hold-out cross-validation with 80:20 ratios for the training and validation sets. To compare this model against the others, we applied 30-time bootstrapping to compute the predictive performance. To re-calibrate all models using logistic regression, we also applied 30-time bootstrapping.

Evaluation metrics
We used the area under the receiver operating characteristics (ROC) curve (AUROC) as the primary evaluation metric. This selection was because the AUROC is threshold-agnostic. However, before evaluating this, we reported the calibration metric of a model using an LR in which the predicted probability as the model output became the only covariate. The models were well calibrated if the 95% CIs of the intercept and slope respectively covered 0 and 1, with the probability plots visually aligned with the reference line. Models were evaluated with and without re-calibration (see "Model Validation"). We chose all models that complied with the calibration metric. The best models were well-calibrated models that significantly outperformed others, according to the AUROC. All metrics are reported with 95% CI. A model outperformed the others if the interval estimate was greater than the central value of the other models. Otherwise, more than one model might be selected. The best model was determined using the internal validation set. It also should be robust based on all external validation sets, for which the central value of the AUROC approximated >0.5, as a baseline value to determine if a predictive performance of a model was better than random or coin-flip guessing. Compared to the same baseline value, we also computed the specificity, accuracy, positive predictive value (PPV) or precision, and negative predictive value (NPV) using a threshold at approximately a sensitivity or recall of~90% or a false negative rate of~10% because the risk of underdiagnosis outweighs that of over-diagnosis. In addition, we explored the best model to identify important features post-analysis (see Results).

Ontology analysis
Our DI-VNN model explore ontological relationships among the predictors in predicting the outcome. A detailed technical explanation of the DI-VNN algorithm was previously described elsewhere [65]. Briefly, there were three steps: (1) differential analysis for feature pre-selection; (2) structural representation of features; and (3) CNN model training.
We applied a differential analysis to choose 18 candidate features for DI-VNN among 37 predictors and 19 PCs (Fig 1). The differential analysis applied quantile-to-quantile normalization, which removed technical inter-variability (i.e., to measure predictors) across the subjects. By t-moderated statistics, a differential analysis selected candidate features (filter method for feature selection). The null hypothesis was that there is no significant difference in a feature value between positives and negatives. Since a predictor could be selected by chance, which posed the analysis to multiple testing bias, we adjusted the p-values using the Benjamini-Hochberg method. We selected a feature if the adjusted p-value or false discovery rate (FDR) was less than 0.05.
After pre-selection, the candidate features without the outcome were used to construct a structural representation of feature variabilities and inter-relationships. There were two types of structural representation (Fig 1): (1) spatial; and (2) hierarchical. We applied the t-distributed stochastic neighbor embedding (t-SNE) algorithm to cluster the selected features in a three-dimensional positioning spatially. A closer position means a higher correlation between a pair of features. Meanwhile, we applied clique-extracted ontology algorithm to cluster the selected features in a hierarchy. Features are more similar to those within the same ontology than those in a different ontology. Since these ontologies were hierarchical, we could evaluate which ontology was more predictive between that with fewer features (i.e., child ontology) and that with more features (i.e., parent ontology) after the model training.
Eventually, we used the representation as a CNN architecture and trained it using a backpropagation algorithm to predict the outcome. In CNN modeling, a maximum value would represent closer values in a multidimensional array. In this way, inter-relationships among features were also taken into account when predicting the outcome in addition to their values. The backpropagation algorithm in a CNN modeling also allowed us to signify which features and their inter-relationships were more weighted to predict the outcome. A more-extreme weight, either positively or negatively, was represented with a higher color intensity when visualizing the internal properties of our DI-VNN model. Therefore, using this ontology analysis, we could evaluate: (1) which set of features (i.e., ontology) were more predictive; (2) how these ontologies were connected; (3) what were important features in an ontology; and (4) how these features were related within an ontology.

Most subjects have not obtained a university education, were not separated/divorced, and are religious believers
We developed four diagnostic prediction models using a cross-sectional dataset (n = 1252). These models were externally validated (n = 129) with non-local ethnic groups unobserved in the development sets (Table 1). Model validation may be challenging since estimates of depressive symptom prevalences in the validation set differed from those of the development sets. However, ethnicity may affect the distributions of the predictors and the outcome to some extent. A prediction model should be robust against the shift of data distribution (i.e., wellgeneralized). Therefore, our validation sets allowed the generalization test, including data with non-local ethnicities, which would extend our model application to new data with ethnicities different from ours.
The prevalence of depressive symptoms in older adults differed among ethnic groups ( Table 1). The Tolaki ethnic group had the highest prevalence. Prevalences were similar between the Bugis-Makassar and Buton ethnic groups. Only one local ethnic group was similar to those not from Sulawesi Island in terms of the prevalence estimate of depressive symptoms, which was the Muna ethnic group. Both the Bugis-Makassar and the Tolaki ethnic groups were considered the majority of community-dwelling older adults in our dataset.
We only used the training set to develop the models. This procedure would be similar to prediction model developed and validated under different studies. Nevertheless, we needed to identify the characteristics of the dataset we used for training the prediction models (Table 1). Future use of our models will likely benefit those with similar characteristics, particularly in the predictors used in the final model. As intended, we developed our models for older adults aged �60 years. This intention characterizes older adults as reasonably having comorbidities and poorer health conditions of hearing, oral status, and visual function, which are considerable compared to younger adults. However, in all those categorical variables (excluding Table 1. Most subjects have not obtained a university education, were not separated/divorced, and are religious believers.

Description
Prevalence (95% CI) comorbidities), most were in a fair health condition, probably because these subjects had routinely visited a CHC for 9 or 10 years on average. Most of the subjects had not obtained a university education. They had two to six children, were mostly not separated/divorced, most lived with either a spouse or other family members, and were unemployed. Their incomes were considerably low for this country. Most of the subjects, if not all, were religious believers. We saw similar characteristics between GDS-15 positives and negatives, except for the Tolaki (p = 0.012) and Bugis-Makasar ethnic groups (p = 0.48), the number of comorbidities (p = 0.005), the employment status before 60 years of age (p = 0.002), male gender (p = 0.035), a poor health condition (p = 0.016), a living alone status (p = 0.011), and a separated/divorced status (p = 0.042). In addition, to deploy our models, we provided a web application (https:// predme.app/pre_gds15) using the best models as a prototype before incorporating the application into an EHR system. Religion options were provided for many religions to keep the application inclusive and avoid inequality. We also used a Big Mac index commonly used to convert income to the same notion in any country [69].

The well-calibrated models were SPC-GBM with re-calibration and DI-VNN without re-calibration
We applied binarization of categorical variables of 17 predictors resulting in 37 predictors without a perfect separation problem in the training set. We used these predictors to develop an LR model with regularization. Because we needed to pursue 20 events per variable, only the top 19 PCs were retained for feature selection by the multivariable LR. These PCs accounted for 81.7% of the variance explained (95% CI 81.68% to 81.72%). Furthermore, we only used seven selected PCs for model development by the RF and GBM algorithms because this allowed us to pursue >50 events per variable. These were the selected PC (SPC)-RF and SPC-GBM models. Meanwhile, of 17 predictors and 37 PCs for the DI-VNN, only 18 had an FDR of <0.05 by the differential analysis using the Benjamini-Hochberg correction. This analysis pre-selected all candidate predictors before being a candidate predictor of the DI-VNN. In the differential analysis, only one variable was used for each analysis. This procedure ensured more than 20 events per variable for each analysis. Subsequently, the Benjamini-Hochberg method corrected the multiple testing effects. We compared calibration metrics and plots of these models with and without re-calibration (Fig 2). Only two models were well-calibrated. These were SPC-GBM with re-calibration ( Fig 2B) and the DI-VNN without re-calibration (Fig 2A). The LR model was visually aligned after re-calibration ( Fig 2B), but the 95% CI of the calibration intercept did not cover 0. The SPC-RF without re-calibration also did not cover 1 by the 95% CI of the calibration slope. Meanwhile, re-calibrating this model resulted in a dichotomous probability that reduced its clinical utility (Fig 2B). Neither the calibration intercept nor slope of the SPC-GBM without re-calibration (Fig 2A) respectively covered 0 or 1. Unlike this model, the DI-VNN with re-calibration ( Fig 2B) fulfilled the intercept and slope criteria but not the calibration plot.

The best model was the SPC-GBM but undifferentiated from the DI-VNN in external validation
We only used the training set to determine the best from two well-calibrated models: SPC-GBM with re-calibration ( Table 2). As observed in this study, the RF and GBM algorithms achieved suitable predictive performances by overfitting the training set. For example, predictive performances of SPC-GBM were reduced by 42.08% and 37.98%, respectively, for    models. In addition, a previous study also applied a questionnaire-free method to predict the GDS-15 in older adults living alone using a wearable device, but the model was considerably overfitting because of a small sample size (AUROC 0.96, 95% CI 0.91 to 0.99; n = 47) [24]. In addition, according to any metrics evaluated in this study, predictive performances of SPC-GBM were better than random or coin-flip guessing (e.g., the AUROC point estimate of SPC-GBM >0.5).
A low education but literate and living alone was predictive in the SPC-GBM while living alone with significant life events, religion, and family support were predictive in the DI-VNN Using both models, we could identify how important the predictors are in predicting the GDS-15. There were seven PCs in the SPC-GBM. They were latent variables that represented the 37 predictors but with different weights. Details are described elsewhere on how the weights were inferred [62]. We visualized the absolute values of these weights for each selected PC (Fig 3). Absolute values were used because the positive/negative values cannot be interpreted straightforwardly, regardless of whether these tend to be events or non-events. By observing the visualization, we could infer the meaning of the latent variables. These were named based on the higher absolute values by referring to particular predictors. The most important PC in the SPC-GBM was PC11 (education and living status). In this PC, older adults with a low education but literate and living alone tended to be predictive. The other most important PCs were PC4, PC8, and PC10, which implied religious perceptions, educational perceptions, and current employment status on health. We should have described religion explicitly to maintain our prediction models' inclusiveness. Education also contributes to PC10. Both PC8 and PC10 also had larger weights on the oral status. Less important predictors were PC16 (very poor hearing), PC14 (very poor health and others), and PC18 (unknown). The last PC has sporadic, slightly weighted predictors.
The DI-VNN also selected PC4 and PC18 as features (Fig 3). There were original predictors selected in this model, which were religion A (F1) or Z (F3), poor (F2) or good (F10) health conditions, living alone (F6) or with family members but without a spouse (F8), a separated/ divorced marital status (F7), a previously employed status (F9), medications (F4), and comorbidities (F5). Beyond PC4 and PC18, there was PC5 (health problems). It was related to comorbidities (F5) and medications (F4). PC4 was also reinforced by PC37 (religion), with less involvement in the health aspect. Poor-health medication (PC28) was also selected with larger weights on the selected predictors, which were a poor health condition (F2) and medications (F4), and the deselected ones, which were education and income. PC27 and PC26 were the previous employment status (F9), but PC27 also had larger weights on age, hearing problems, and the number of children. The last PC21 had larger weights on several predictors related to family support of health.

Living alone with significant life events was positively predictive in the DI-VNN but the opposite if believing in a religion that attracts family activities
While PCs in the SPC-GBM were independently interpreted, those could be interconnected in the DI-VNN (Fig 4). It also included the predictors of origin. Each ontology predicted an outcome in the DI-VNN, contributing to the optimization of the predictive performance. If we used the model architecture up to each ontology for predicting the outcome, different AUR-OCs were shown (Fig 4A). The top three highest AUROCs were those predicted up to the root, ONT:20, and ONT:22. Each ontology was visualized for the array difference between GDS-15 positives and negatives (Fig 3B). Those for the negatives subtract the weighted features for GDS-15 positives. Positive and negative results from this subtraction referred to GDS-15 positive and negative predictions. Details are described elsewhere on how each ontology prediction was taken into the final prediction and which layers were used for feature visualization [65].
In the root ontology array, subjects living alone (F6) with comorbidities (F5) and multiple factors (PC18, unknown) tended to be predicted as GDS-15 positives. By tracing through ONT:25 and ONT:22, the PC18 factors were closer to the separated/divorce marital status (F7). Health problems (PC5) in ONT:20 were also closer to the religious perception of health (PC4) in the parent ontology, which was ONT:24. Subjects with PC5 and PC4 tended to be predicted as GDS-15 negatives. Similar predictions assigned subjects with family support (PC21) on poor health conditions (F2), as shown by ONT:23. This was related to religion A (F1) in ONT:19 that was connected to ONT:24 with PC21, F2, and F5 (comorbidities). The last feature in ONT:24 had an opposite tendency on the GDS-15 outcome with the same feature in the root ontology.

Discussion
In this study, we developed four machine-learning models to predict GDS-15 results among community-dwelling older adults. Experimental results demonstrated the feasibility of our approach of applying a questionnaire-free method for developing a triage test for the GDS-15 based on routine data from CHCs. The predictive performances were validated using random and non-random data partitioning, but we only used the training set to develop the models. The validation allowed model generalization to a non-local ethnic group for the SPC-GBM and DI-VNN models.
From 37 PCs, we found seven PCs with top absolute weights, which contributed to the prediction using the SPC-GBM with re-calibration. In comparison, 10 original predictors and eight PCs contributed to the prediction using the DI-VNN without re-calibration. A web application is provided using both the SPC-GBM and DI-VNN, but the latter model was used for individual exploration of either protective or risk factors. It is because the DI-VNN has a deeper exploration capability. In this study, the AUROC of the DI-VNN was very similar to that of the SPC-GBM. This finding may reveal insight into precise behavioral interventions: (1) to prevent depressive symptoms from turning into major depressive disorders; or (2) to mitigate further progression of this disorder. Note that our system should also be automatic in recommending a GDS-15 evaluation. Manual input by clinicians considerably cancels out the objective of this predictive system.
SPC-GBM has demonstrated moderate sensitivity and specificity based on external validation with either local or non-local ethnicity. Among individuals experiencing depressive symptoms (i.e., positives), an incorrect prediction (i.e., a negative) may cause an individual to be undiagnosed. Hence, a predicted negative should be confirmed by DI-VNN, which demonstrated high sensitivity by external validation. Contrarily, among individuals without depressive symptoms, this may cause overdiagnosis. However, the false positives will be screened by GDS-15 instead of being a definitive diagnosis. Nonetheless, using the baseline value, the predictive performance of SPC-GBM was better than random or coin-flip guessing for any metrics evaluated in this study.
The second predictor was religious perceptions of health in PC4, with important predictors having a fair oral health status, a fair health condition, and religious beliefs. Depressive symptoms were associated with religiosity based on these PCs. This finding is in line with those of previous studies, which showed that the severity of depression increased with a higher number of missing teeth, the number of decayed teeth, and oral dryness [45]. In addition, religious beliefs were among the important variables in our prediction models. Faithful religious believers have lower levels of depression than non-believers [36].
The educational perception of health (PC8) was also an important predictor of depressive symptoms, which consisted of a very good oral health status, having a primary education and being illiterate, having poor/very good health, and the duration of routine visits to the CHC. The frequency of regular visits to the CHC in this study might have promoted good health in these older adults. CHCs are the first-line promoter of community health, and this seems to be protective against depression. However, depression was also associated with the length of stay, outpatient and inpatient costs, and increasing use of any healthcare facility, including outpatient visits [43].
Another important variable was PC10, which consisted of poor oral health status, very good health conditions, current employment, and higher education. Unemployed individuals and individuals who moved from permanent to precarious employment had an increased risk of clinically relevant depression [32,71]. Nevertheless, among older people who work, depression can also cause job loss [30]. Therefore, the predictive value of this latent variable may be either a cause or an effect of depressive symptoms. Very poor hearing in PC16 was also important for predicting depressive symptoms. It is reasonable that hearing problems and very poor health conditions would increase the risk of depressive symptoms. Hearing loss is the third most frequent chronic health problem among older adults and can affect health conditions [72,73]. The low health conditions of PC14 were also the same as those of PC5 with health problems of comorbidities and medications (PC28). Lastly, we found other important variables in the DI-VNN model: 1) family support of health (PC21) with predictors of the oral status and visual problems, health conditions and income, education, employment, and gender; 2) living (F6, F8) and marital status (F7); and 3) comorbidities (F5) and medications (F4). Income was also a determinant factor of depression in outpatient care in hospitals in Indonesia [34].
In conclusion, the best prediction models were the SPC-GBM and DI-VNN models. One can use these models in our web application to screen for depressive symptoms along with the GDS-15 at any time. If deemed positive, according our models, an older adult is only then asked to answer questions in the GDS-15. This workflow allows for more-frequent screening and may help detect depressive symptoms at any time. Since later-life depression often causes multiple physical symptoms, we would expect reduced unnecessary costs for related diagnostic procedures and interventions. However, future studies are needed to confirm the impacts of our models in improving both the detection and early intervention of older adults with depression.

Limitations of the study
This study has several limitations. An older adult who is an atheist or believes in religion beyond those in our dataset might not be well-predicted. The Big Mac index perceives income as a notion of primary need, which is food, while depressing problems related to income may manifest as different notions. Populations with similar characteristics to those in our training set are warranted to use our prediction models. The predictive performance may differ if older adults have high education, are single, have previous employment, have a job, and have no religious beliefs. More-similar characteristics to our target population would lead to more-optimal predictive performance.
Although the SPC-GBM with re-calibration had the best performance in the internal validation set among the well-calibrated models, the performances were undifferentiated in the external validation set with local ethnicity compared to the DI-VNN without re-calibration. Nonetheless, we only used the internal validation set to choose the best model. It is because choosing the best model by the external validation set might lead to an optimistic bias or overfitting; instead, external validation sets were used for a robustness test of the performance of the prediction models [58]. Eventually, despite the model's reliability demonstrated in the paper using external validation, one should still not assume generalizability for any other population with different characteristics. External validation is still required for such population. Yet, this is a general issue in prediction studies, not limited to our study.

Inclusion and diversity
We worked to ensure gender balance in the recruitment of human subjects. We worked to ensure ethnic or other types of diversity in the recruitment of human subjects. We worked to ensure that the study questionnaires were prepared in an inclusive way. One or more of the authors of this paper self-identifies as an underrepresented ethnic minority in science. The author list of this paper includes contributors from the location where the research was conducted who participated in the data collection, design, analysis, and/or interpretation of the work.
Supporting information S1 File. Checklists and questionnaire. This file consists of: (1) S1 Table. Guidelines for developing and reporting machine learning predictive models in biomedical research; (2) S2 Table. Prediction model risk of bias assessment tools (PROBAST); (3) S3 Table. Clinical checklists for assessing suitability of machine learning applications in healthcare; and (4) S4 Table. The