Dementia risks identified by vocal features via telephone conversations: A novel machine learning prediction model

Due to difficulty in early diagnosis of Alzheimer’s disease (AD) related to cost and differentiated capability, it is necessary to identify low-cost, accessible, and reliable tools for identifying AD risk in the preclinical stage. We hypothesized that cognitive ability, as expressed in the vocal features in daily conversation, is associated with AD progression. Thus, we have developed a novel machine learning prediction model to identify AD risk by using the rich voice data collected from daily conversations, and evaluated its predictive performance in comparison with a classification method based on the Japanese version of the Telephone Interview for Cognitive Status (TICS-J). We used 1,465 audio data files from 99 Healthy controls (HC) and 151 audio data files recorded from 24 AD patients derived from a dementia prevention program conducted by Hachioji City, Tokyo, between March and May 2020. After extracting vocal features from each audio file, we developed machine-learning models based on extreme gradient boosting (XGBoost), random forest (RF), and logistic regression (LR), using each audio file as one observation. We evaluated the predictive performance of the developed models by describing the receiver operating characteristic (ROC) curve, calculating the areas under the curve (AUCs), sensitivity, and specificity. Further, we conducted classifications by considering each participant as one observation, computing the average of their audio files’ predictive value, and making comparisons with the predictive performance of the TICS-J based questionnaire. Of 1,616 audio files in total, 1,308 (81.0%) were randomly allocated to the training data and 308 (19.1%) to the validation data. For audio file-based prediction, the AUCs for XGboost, RF, and LR were 0.863 (95% confidence interval [CI]: 0.794–0.931), 0.882 (95% CI: 0.840–0.924), and 0.893 (95%CI: 0.832–0.954), respectively. For participant-based prediction, the AUC for XGboost, RF, LR, and TICS-J were 1.000 (95%CI: 1.000–1.000), 1.000 (95%CI: 1.000–1.000), 0.972 (95%CI: 0.918–1.000) and 0.917 (95%CI: 0.918–1.000), respectively. There was difference in predictive accuracy of XGBoost and TICS-J with almost approached significance (p = 0.065). Our novel prediction model using the vocal features of daily conversations demonstrated the potential to be useful for the AD risk assessment.


Introduction
Identifying individuals at risk for Alzheimer's disease (AD) in the prodromal phase might lead to early detection and alleviation of the burden of AD among patients and caregivers [1][2][3][4][5]. Due to difficulty in early diagnosis of AD related to cost and differentiate capability [6][7][8], it is necessary to identify low-cost, accessible, and reliable tools for identifying AD risk in the preclinical stage. Lately, an increasing amount of research has accumulated evidence about the greater accuracy and efficiency of prediction models using machine-learning algorithms such as random forest (RF) and extreme gradient boosting (XGBoost) as compared to the conventional schemes in medical classification problems [9,10]. Indeed, recent studies have shown a number of successful applications of machine learning approaches to large-scale data for predicting disease, including AD, diabetes, metabolic syndrome, suicide, opioid overdose, or drug-resistant epilepsy, among others [11][12][13][14][15][16]. However, for AD risk prediction, the machine learning model developed in the previous study used large administrative health data (e.g., sociodemographic information, health profiles, and history of personal and family illness) and showed an Area Under the Curve (AUC) of 0.775, indicating much room for improvement.
The neurophysiology of AD provides a perspective for further improving AD risk prediction. AD patients represent the degree of deficits in specific cognitive constructs: neurophysiologic change following the progression of AD (e.g., presence of amyloid plaques, neurofibrillary tangles, and diffuse degeneration and atrophy of various parts of the cortex) can lead to changes in sensory perception and motor symptoms, resulting in impairment of spontaneous speech [17][18][19]. A stream of evidence has shown that AD patients are more likely to speak more slowly and with longer pauses, and spend more time finding the correct word, resulting in broken messages and lack of speech fluency [20][21][22]. These indicate the possibility of further developing further accurate prediction models using vocal features to identify AD risk [23,24]. However, evidence about AD prediction using vocal features remains scarce.
The purpose of the present study is to 1) develop a novel machine learning prediction model to identify AD risk using only vocal features collected from daily conversations via telephone, and 2) evaluate the predictive performance of the model by comparing results of multiple machine learning algorithms with conventional cognitive tests. We believe that if the developed model using daily conversation voice data can accurately predict AD risk, it will have a significant impact on early detection and diagnosis among the general older adult population in that we can guide those who are in the earliest stages of AD to engage in care-seeking behavior.

Study design
The present study is a retrospective analysis of voice data and conventional cognitive test data among individuals ages 65 and older who participated in a program aimed at dementia prevention in a Japanese local city. Using these data, we developed prediction models and compared predictive accuracy with that of a conventional cognitive test. The present study was approved by the Institutional Review Board of Kyoto University (examination number: R2721). This paper adhered to the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis statement (TRIPOD), which was proposed for the reporting of predictive models [25].

The contents of the dementia prevention program
The dementia prevention program conducted by the city consisted of a telephone conversation with AI that covered the following contents: 1) assessment of cognitive function based on the Japanese version of the Telephone Interview for Cognitive Status (TICS-J) [26] (day 1 only); 2) asking participants to talk about daily life for one minute using questions such as "What did you do yesterday?"; 3) recommendations on healthy behavior including healthy diet, physical activity, and social participation based on topic recognition by AI analysis of participants' voice patterns. The AI built into the program for voice recognition was developed by Softfront Japan (Tokyo, Japan) and McCann Health Japan (Tokyo, Japan). The program included 1-2 months of weekday telephone conversations. This telephone conversation program was adopted by Hachioji City because a service via telephone is highly accessible for every resident and does not require preparation of any additional devices. All that participants needed in advance was a registered telephone number and their name.

Study protocol
We used data obtained from the telephone conversation program conducted between March 2020 to May 2020. The data we received came from HC and AD patients that had at least one valid audio file, numbering 99 and 24, respectively. For all patients we extracted 1) the results of the assessment of cognitive function with a questionnaire based on TICS-J, 2) voice data of participants, especially the 1-minute talk portion, and 3) a binary variable indicating whether they were an HC (0) or AD (1) patient. Whereas we got one result of TICS-J based questionnaire for each participant, we obtained multiple recordings of the 1-minute talk for each participant because the telephone conversation program consisted of 1-2 months of daily weekday telephone conversations, resulting in multiple recording files for each participant, with an average of 13.1 (Standard deviation: 7.6). The data-processing steps, as well as other processes in the present study, are shown in Fig 1. Dataset creation and definitions AD data (outcome). Our study consisted of 99 HC and 24 AD patients. As stated above, HC and AD patients were recruited in different ways. Whereas HC patients were recruited via mail, AD patients were recruited in person given the difficulty of explaining the program. AD patients were previously diagnosed using National Institute on Aging-Alzheimer's Association (NIA-AA) criteria [27] and/or the Diagnostic and Statistical Manual of Mental Disorders, 5 th ed. (DSM-5) [28] before the program. We had to exclude patients with severe AD from recruitment as they could not participate in the telephone conversation program due to limitations in cognitive capacity. Thus, those included in the telephone conversation program may represent patients with mild/moderate AD or mild cognitive impairment (MCI). We coded 1 if a participant was an AD patient, and otherwise coded a 0, and used this binary variable as an outcome for prediction.
Vocal feature extraction. The voice data used for model prediction were collected through the telephone conversation program: each participant was asked to have a nested conversation with an AI computer program. The conversation consisted of a greeting, a task that asked the participant to describe what he or she did yesterday with as much detail as they could in one minute, and closed with recommendation of health behavior and scheduling of the next call. The participant's response to the task was the only part of the conversation recorded and used for future analysis. The reason we used this one-minute task is that in many validated questionnaires for screening of dementia like MMSE, memory and the ability to express one's thoughts are crucial elements that have high discriminating ability in screening for dementia [29].
After recording, vocal features were extracted using the open software tool PRAAT [30]. PRAAT has been widely used for phonetic analysis worldwide, and it enables us to extract various vocal features from recorded speech. In our study, for each voice recorded we extracted all possible information including 1) the start and end time of all sounding and silent intervals, 2) intensity by every 0.01 of a second, 3) pitch by every 0.02 of a second, and 4) center of gravity, skewness, kurtosis, and standard deviation. All four features were written into four separate txt files by running a PRAAT scripting language. Then, python scripts were developed to read all txt files and generate variables used for model prediction. Based on previous studies, we made some modifications and ultimately generated 60 vocal variables [24,31]. In this process, intensity and pitch were further used to generate the "derivatives", i.e. the change in intensity or pitch every time interval, by subtracting the intensity or pitch at the previous time point from the present time point. For intensity and pitch, as well as their "derivatives", we generated the following variables: mean, median, minimum, maximum, 0.15 percentile, 0.85 percentile, standard deviation, skewness, and kurtosis. For example, the median value for the "derivatives" of pitch means the median value of the person's changes in pitch across (altogether 4 � 9 = 36 variables). For sounding and silent intervals, in addition to the above variables, we added the sum of length of both types of intervals (altogether 2 � 10 = 20 variables). For spectrum, we computed center of gravity, skewness, kurtosis, and standard deviation as another four variables. All the vocal features created by this process are shown later (Tables 2 and S2). We finally obtained 1,465 and 151 audio files for HC and AD patients with averages of 15.8±5.9 and 5.0 ±6.2 files for each participant, respectively.

Model generation
We developed three machine-learning prediction models, applying the extreme gradient boosting (XGBoost) [32], random forest (RF) [33], and logistic regression (LR) [34]. We computed these models using the "caret" package in R (Version 4.0.2) [35]. These algorithms may be basic but are well-accepted to deal with predictive tasks regardless of the field. All models were trained and tested on a randomly partitioned 80/20 percentage split of the dataset. We conducted cluster randomized partition so that audio files of the same participant were not included in both training data and validation data. The models for 'audio-based prediction' were developed using each audio file as one observation (HC: n = 1,465, AD: n = 151). For 'participant-based prediction', we averaged each audio's predictive value for every participant (HC: 99, AD: 24). Further, we also developed a TICS-J based questionnaire model, with a validated classification method using cognitive test, for 'participant-based prediction'. We described an illustration of difference between 'audio-based prediction' and 'participant-based prediction' in Extreme gradient boosting model. In short, XGBoost is an ensemble of classification and regression trees (CART) [36]. A classification/regression tree is trained based on an ensemble of previously trained classification/regression trees in order to improve predictive accuracy through the minimization loss function: in other words, the algorithm's computation of boosting is built on a number of weak classifiers. As each CART assigns a real score to each leaf (outcome), the predictive scores for a CART are summed up to calculate the final score and assessed through additive functions. XGBoost has been widely accepted as the one of the models with the most impressive predictive accuracy [37,38].
Random forest model. RF is an ensemble-based method that uses multiple decision trees like XGBoost, but it is different in that RF computes predictive scores by averaging the vote for each tree, iterating over all trees in the ensemble [33]. Each tree is developed from a random subset of the dataset through a bagging method. As each tree tends to overfit in a different way, random decision forests can correct for this overfitting by voting. RF is frequently used in research and business settings as it requires few configurations and generates reasonable predictions for a wide range of data.
Logistic regression model. LR is a commonly used statistical method for a variety of classification tasks [34]. LR employs a logistic function to model a binary outcome represented by '0' or '1'. The model assumes the log-odds for the outcome coded '1' is a linear combination of independent variables. Thus, LR is an extension of the linear regression model for classification. LR has advantages in that it is easier to compute, interpret, and efficient to train, whereas disadvantages include the assumption of linearity between outcome and independent variables.
TICS-J based questionnaire model. In addition to machine learning models, we developed the scoring model using the TICS-J based questionnaire, assessing the cognitive function of participants through telephone interviews [26]. TICS-J is the Japanese version of TICS, which consists of an 11-item screening test that was developed for assessing cognitive function in AD patients who are unwilling or unable to be examined in person [39]. TICS has been widely accepted for measuring cognitive function and performance, and was significantly correlated with a Mini-Mental State Examination (MMSE) score (r = 0.86, p<0.001) [39]. TICS-J also showed high performance in differentiating AD patients from HC with a sensitivity of 98.0% and specificity of 90.7%, and also significantly correlated with MMSE score (r = 0.86, p<0.001) [26]. We adopted the cognitive function test via telephone interview based on TICS-J (S1 Table). As the original version of TICS-J is supposed to be conducted with a human operator, the setting of the cognitive test was different with our program: the telephone conversation program was conducted between the AI computer program and participants, leading to some changes in the questionnaire, taking into account the limitations on voice recognition and communication of AI. For instance, question 6: "One hundred minus 7 equals what?" should be stopped at 5 serial subtractions in the original version, whereas we stopped at 2 serial subtractions. Also, question 7: "What do people usually use to cut paper?" and "How many things are in a dozen?" should be followed by the subsequent two questions "What do you call the prickly green plant that lives in the desert?" and "What is tofu made from?" in the original version, yielding 4 points in total. We needed to cut parts of questions 6 and 7 in order to save time for the entire interview and avoid impairing the whole questionnaire, because we found that many participants could not last through the long interviews with AI and hung up before it was completed. The total score for the questionnaire was calculated by a human data administrator, using the recording for each participant. The score ranged from 0 to 36 and used the cognitive ability measure as a continuous variable. Those scored below 25 were classified as AD, according to the threshold of TICS-J.

Tuning of parameters
We needed to consider the fine-tuning of several parameters when adopting XGBoost, RF, and LR. The parameters for our prediction models were set through a grid search, a method for optimization of parameters using combinations of each parameter. We trained 10 different models with 90% of the training data and tested them with the remaining 10% for each grid search process. The results of the grid search for the prediction models are shown in Table 1.
In the end, we developed the models for prediction using these parameters.

Model comparison
We carried out two types of model comparison: one based on 'audio-based prediction', and the other on 'participant-based prediction'.
Audio-based prediction. First, we made a comparison between models with machinelearning (XGBoost, RF, and LR) using each audio file as one observation. We evaluated the predictive performance of the developed models by describing the receiver operating characteristic (ROC) curve, calculating the areas under the curve (AUCs), sensitivity, and specificity. Subsequently, we compared the predictive performance of developed models using the chisquared test proposed by DeLong [40]. We determined the threshold for each model using the Youden index that maximizes Sensitivity + Specificity-1 [41].
Participant-based prediction. Subsequently, we made a comparison between models with machine-learning (XGBoost, RF, and LR) and the TICS-J based questionnaire using each participant as one observation. As stated above, we yielded 1,465 and 151 audio files for HC and AD patients with averages of 15.8±5.9 and 5.0±6.2 files for each participant, respectively. By regarding each audio file as one observation, our development of the prediction models made a tacit assumption that each audio file is independent in terms of vocal characteristics, which is not actually the case. Although we made sure that the audio file of the same participant would not be included in both the training data and validation data, it raised a potential problem. Thus, as a further evaluation of our prediction models, we conducted additional analysis to measure the predictive accuracy for each participant, not for each audio file. Although limited sample size meant that we could not develop a model based on each participant, we instead conducted a participant-based prediction by computing the average of the predictive value among their multiple audio files, and used this for the validation data. We already described this concept in Fig 2 as an illustration. The metrics used for comparison were the same as the audio-based prediction. All of the analyses were conducted in R (Version 4.0.2) [42].

Descriptive characteristics of participants
Our final participants consisted of 99 HC and 24 AD patients, yielding 1,465 and 151 audio files for each group. Of 1,616 audio files in total, 1,308 (81.0%) were randomly allocated to the training data and 308 (19.1%) to the validation data. Among those, 123 (9.4%) of the training data and 28 (9.1%) of the validation data were audio files of AD patients (S2 Table). On a participant basis, 99 (80.5%) were allocated to the training data and 24 (19.5%) to the validation data. The mean age±SD for training data and validation data were 74.6±6.6 and 76.7±7.5, respectively. The proportion of females was 57.0% in the training data and 54.1% in the validation data. The AD patients in the training data and validation data were 24 (24.2%) and 6 (25.0%), respectively ( Table 2). The comparison results of audio-based prediction. The predictive performance of developed machine-learning models built for each audio file are represented in Table 3, and the ROC curves for each model are shown in Fig 3. The AUC for XGboost, RF, and LR were 0.863 (95% confidence interval [CI]: 0.794-0.931), 0.882 (95% CI: 0.840-0.924), and 0.893 (95%CI: 0.832-0.954), respectively. The LR model achieved the best AUC, but there were no significant differences between the performances of the models.

Discussion
The machine learning models we developed, which were based on models built for each audio file, did well at classifying the audio files of AD and HC patients. Further, when the average of the predicted values of each audio file was summarized for each participant, the XGBoost model demonstrated performance comparable to cognitive tests, with almost approached significance.
Our finding is in line with previous studies. There is growing consensus that the presence of language deficits could be a part of clinical manifestation of AD and MCI, and suggestion that assessment of language production might be able to represent a unique opportunity for early detection of AD [43]. There have been several preceding works representing the performance of prediction models to differentiate AD from HC using acoustic and language features [44,45]. Our results further supported this line of evidence. Moreover, our novel prediction model is significant in the sense that it showed strong performance even though it was developed solely from vocal features: previous studies tended to use other features such as demographic information in addition to vocal features to achieve high predictive accuracy [45]. Another strength of our study is that our vocal features consisted of daily conversations, not NPT in a clinical setting. Our achievement in predicting AD well using only vocal features from daily conversation indicates the possibility of developing a pre-screening tool for AD among the general population that is more accessible and lower-cost. Our prediction models averaging the predictive value of each audio file for each participant showed even stronger performance than those built for each audio file. Although we need to interpret this result with caution, it might have potential for further robust prediction of AD by obtaining multiple audio files of daily conversations for each participant. Nevertheless, we are currently not sure if this method of compressing predictive values by arithmetic mean is appropriate for predicting AD risk in datasets other than those we already obtained. Although this idea, averaging the multiple predictive value of each weak learner, is widely accepted as a part of machine-learning algorithm such as random forest [33], further study is required to validate our models and whether or not they predict AD risk well for completely new subjects. The findings of our study can create the opportunity for building new tools to identify AD risk by using only vocal features obtained from daily conversations via telephone, as a prescreening method among the general population. It might enable and drive early detection and diagnosis of dementia, including AD, in the sense that the tool can be used not only by healthcare professionals in a clinical setting, but also the general population at home. As internet and mobile technology further improves, our prediction model can also be easily installed on a variety of user interfaces, such as websites, mobile apps, or the Internet of Things (IoT). Indeed, there have been several recent research assessments of cognitive health showing remarkable accuracy, based on machine learning algorithms using data from smart homes or smartphones [46,47]. Given that many individuals who meet the criteria for dementia are estimated to be undiagnosed [4], providing the opportunity to assess their AD risk would lead to further care-seeking behavior and subsequent early detection among those "unconscious" people.

Limitations
There are several limitations to our study. First, the outcome variable we used was binary (AD or HC), ignoring various features among AD patients. For example, speech characteristics may differ between advanced AD patients and MCI patients. Future research is expected to build prediction models for both advanced AD and MCI based on more detailed diagnostic information. Second, our small sample size to some extent limited our predictive power. Third, the quality of audio differed depending on the participants and time, raising the possibility that this affected the performance of the prediction models. Fourth, the questionnaire based on TICS-J that was used to assess cognitive function was conducted between the AI computer program and participants; the limited speech recognition ability of the AI computer program can affect the validity of obtained results. Fifth, we only relied on superficial vocal features such as pitch, intensity, etc. in the analysis, raising the possibility of loss of information and insufficient audio feature extraction. Further research could include natural language processing of speech content and sentence structure analysis in order to reduce information loss and increase model prediction performance. In practice, this device could be helpful to use as a gatekeeper of early diagnosis of AD through potential patients' daily life. For the final diagnosis, it is necessary to also consider other symptoms, along with medical doctors' judgements.

Conclusions
Prediction models based on machine learning algorithms that use only vocal features from daily conversations showed strong predictive performance of AD risk, and were compatible with existing cognitive tests. This opens the possibility of developing new accessible, low-cost pre-screening tools for AD risk among the general population, outside a clinical setting.
Supporting information S1 Table. The questionnaire for assessing cognitive function based on TICS-J. (DOCX)