Non-verbal speech cues as objective measures for negative symptoms in patients with schizophrenia

Negative symptoms in schizophrenia are associated with significant burden and possess little to no robust treatments in clinical practice today. One key obstacle impeding the development of better treatment methods is the lack of an objective measure. Since negative symptoms almost always adversely affect speech production in patients, speech dysfunction have been considered as a viable objective measure. However, researchers have mostly focused on the verbal aspects of speech, with scant attention to the non-verbal cues in speech. In this paper, we have explored non-verbal speech cues as objective measures of negative symptoms of schizophrenia. We collected an interview corpus of 54 subjects with schizophrenia and 26 healthy controls. In order to validate the non-verbal speech cues, we computed the correlation between these cues and the NSA-16 ratings assigned by expert clinicians. Significant correlations were obtained between these non-verbal speech cues and certain NSA indicators. For instance, the correlation between Turn Duration and Restricted Speech is -0.5, Response time and NSA Communication is 0.4, therefore indicating that poor communication is reflected in the objective measures, thus validating our claims. Moreover, certain NSA indices can be classified into observable and non-observable classes from the non-verbal speech cues by means of supervised classification methods. In particular the accuracy for Restricted speech quantity and Prolonged response time are 80% and 70% respectively. We were also able to classify healthy and patients using non-verbal speech features with 81.3% accuracy.


Introduction
Schizophrenia is a chronic and disabling mental disorder with heterogeneous presentations. and formant frequencies to the negative symptoms was studied for 25 first episode psychosis patients. In [31] interviews of 20 subjects with flat affect, 26 with non-flat affect, and 20 healthy controls were analysed to determine the motor expressive deficiency in schizophrenic patients.
In [18] clinical ratings of flat affect and alogia were compared to the patient's speech prosody and productivity. The results suggest that acoustic analysis can provide objective measures that may help in clinical assessment. In [32] a semi-automatic method was employed to quantify the degree of expressive prosody deficits in schizophrenia for 45 patients and 35 healthy controls. The results suggest that using non-verbal speech analysis the researchers were capable of classifying patient and controls with 93.8% accuracy. Non-verbal speech cues such as voice tone, volume, and interjections play a crucial role in human interaction and communication [33], and the display of such signals in patients can be used for both distinguishing them with healthy controls and developing specific and objective treatments. In existing work speech analysis has mostly been used to determine the presence and/or the severity of symptoms.
In this study we built upon our earlier work [34] and attempted to explore the utility of non-verbal speech cues of determine the severity of negative symptoms in schizophrenia. Specifically, we study the correlations between subjective ratings of negative symptoms on a clinical scale during interviews and the objective non-verbal speech features extracted from audio recording of the same interview.

Subjects
Fifty-four outpatients diagnosed with schizophrenia from the Institute of Mental Health, Singapore and twenty-six healthy individuals from general population were recruited in this study. The inclusion criteria of the study included diagnosis of schizophrenia for patient group or no mental illness for control group, aged 16-65, English speaking and fit to provide informed consent. The capacity of consent was assessed by asking participants to describe the purpose and procedures of the study to interviewers. All the participants finally selected for the study were consenting adults, above 18 years of age. The exclusion criteria included history of strokes, traumatic brain injuries and neurological disorders such as epilepsy. The Structured Clinical Interview for DSM-IV (SCID) was conducted for all participants to ascertain the diagnoses of schizophrenia for patients and no mental illness for healthy individuals by trained research psychologists. Ethics approval for the study was provided by the National Healthcare Group Domain Specific Review Board. Written informed consent was obtained from all participants. The sample characteristics were presented in Table 1.

Clinically-rated symptom measures
The 16-item Negative Symptom Assessment (NSA-16) [35] is a reliable and validated scale used to measure the severity of negative symptoms through semi-structured interview. It contains 16 items, each rated on a 6-point Likert scale where higher ratings indicate more severe impairments. In addition to the individual item scores, the scale also provides a global negative symptom rating (based on the overall clinical impression of negative symptoms in the patient), a total score (sum of the scores on the 16 items), and five negative symptoms domains scores including Communication (sum of the scores of item 1-4), Affect/Emotion (sum of the scores of item 5-7), Social Involvement (sum of the scores of item 8-10), Motivation (sum of the scores of item [11][12][13][14], and Retardation (sum of the scores of item 15 and 16). The NSA-16 demonstrated high internal consistency (Cronbach's alpha = 0.92) and inter-rater reliability (Kappa value = 0.89) [36]. NSA was rated by trained research psychologists and inter-rater reliability was 0.92. The meanings of NSA items and domains are shown in Table 2.

Acquisition of non-verbal speech features
2.3.1 Equipment and procedure. The system deployed in this paper is based on our earlier work [37,38], where we developed a machine learning based system that is able to infer the levels of interest, dominance, and agreement with 85%, 86% and 82% accuracy respectively from dyadic conversations. We employed easy-to-use portable equipment for recording conversations; it consisted of lapel microphones for each of the two speakers and an audio H4N recorder that allowed multiple microphones to be interfaced with a laptop. The audio data was recorded as a 2-channel audio.wav file (one channel for each speaker). This file allows us to detect which speaker is speaking at any given time.
The patient and the psychologist wore their respective microphones during the whole interview of NSA and the whole conversation was recorded. There was no pre-determined duration for the interview; instead it depended on participant's response to the questions asked by the psychologist. On average, the interviews lasted about 30 minutes. Before the actual recording, we ensured that all the devices are connected. During the interview the software applications were monitored from another room via remote desktop to ensure the recording device functioned properly, and simultaneously maintain confidentiality of the participants' speech.

Extraction of non-verbal cues.
We considered two types of low-level speech metrics: conversational and prosody related cues. The conversational cues accounted for who was speaking, when and by how much, while the prosodic cues quantified how people talked during their conversations. A detailed list of conversational cues is showed in Table 3. We used  Matlab to compute the following conversational cues: the number of natural turns, speaking percentage, mutual silence percentage, turn duration, natural interjections, speaking interjections, interruptions, failed interruptions, speaking rate and response time [37]. We considered the following prosodic cues: amplitude, larynx frequency (F0), formants (F1, F2, F3), and mel-frequency cepstral coefficients (MFCCs). These cues were extracted from 30 ms segments at a fixed interval of 10 ms; they tended to fluctuate rapidly in time. Therefore, we computed various statistics of those cues over a time period of several seconds, including minimum, maximum, mean and entropy. The prosodic features are the standard audio features used in research, but the conversational features have been designed specifically for dyadic conversations. Table 4 provides the definition for these conversational features.

Statistical analyses
First, Matlab was used to test the Pearson's correlation between the objective audio features and the subjective negative symptoms ratings. In the second step, we analyzed the automated prediction of negative symptoms from audio features. We determined the prediction accuracy Nonverbal speech as measures for negative symptoms in patients with schizophrenia for some NSA-16 criteria. The rating scale ranges from 1-6, where a value of 1 and 2 would be coded as non-observable group and and a value between 3 and 6 would be coded as observable.
We then used classification algorithms to determine the accuracy with which observable and non-observable classes can be differentiated. We performed leave-one-patient-out cross-validation to calculate the accuracy and AUC for these criteria. For feature selection we utilized CFSsubset attribute selection [39], and Correlation attribute selection [40]. CFSsubset attribute selection evaluates the worth of a subset of features by considering the individual predictive ability of each feature along with the degree of intercorrelation between the features. Subsets of features that are highly correlated with the class while having low intercorrelation are preferred. Correlation attribute selection [40] evaluates the worth of an attribute by measuring the Pearson correlation between it and the class label. At the end we present the results for the classification of the healthy controls and individuals with schizophrenia. The classification was computed by leave-one-person-out cross-validation, i.e., for each participant the classifier was tested on the instances of that participant and trained on all the remaining instances.

The correlations between non-verbal speech features and NSA scores
The colormaps in Fig 2 showed

Non-Verbal Feature Description
Natural Turn-Taking The number of times person 'A' speaks in the conversation without interrupting person 'B' (see Fig 1). Normalized to per minute.

Turn Duration
The average duration of a speaker's turn.
Speaking % The percentage of time a person speaks in the conversation.

Speaking % Difference
The difference between the speaking percentages of both speakers.
Mutual Silence % The percentage of time when both participants are silent.

Interruption
Person 'A' interrupts person 'B' while speaking, and takes over. Person 'B' stops speaking before person 'A' does (see Fig 1).

Speaking Interjection
Short utterances such as 'okay', 'hmm' etc. when other speaker is speaking (see Fig 1).

Speech Gap
The gap that a person takes between his/her consecutive turns.    Table 7 presents the patients vs healthy controls classification results for various machine learning algorithms along with the best five features. In the last row we present the statistics for a baseline classifier which classifies every participant as a patient. The highest accuracy is 81.3% which is really promising, because it shows that using non-verbal speech features we can differentiate between healthy individuals and individuals with schizophrenia. Fig 3 presents boxplots and results (p-values) for the Kruskal-Wallis test. We tested whether a feature is significantly different for the healthy and patient groups. The plots are only shown for features with p-values below 0.01 for the Kruskal-Wallis test. It can be seen from the figures that frequency and volume entropies show significant difference among the prosodic features, while speaking rate shows significant difference among the conversational features. The healthy group has higher frequency and volume entropy as compared to the patient group. This finding implies that the healthy subjects speak in a less monotonous manner compared to the patients, and have more variability in the volume of their speech. Similarly, the speaking rate seems to be higher for the healthy group. These findings are in line with the results presented in [27], where speech rate significantly discriminated patients and healthy controls, and [31], where patients were found to be less fluent.

Discussion
As can be observed from Fig 2, significant correlations exist between the subjective ratings (NSA-16 items) and the objective measures (conversational features). An interesting point to note here is that the absolute values of the correlations are higher in Fig 2(a) and 2(c) compared to those in Fig 2(b). This difference can be attributed to the fact that NSA-16 items 1-9 (see Table 2) are closely associated with speech impairments. Consequently, these specific NSA-16 items are in greater congruence with the objective speech-related measures compared to the rest of the NSA-16 items, yielding higher absolute values of correlations. Similarly, Fig  2(c) contains the cumulative NSA-16 items, which is reflected in the overall higher absolute correlations. Table 5 presents the correlation values for the NSA-16 items and speech features, providing the complete picture by listing all the correlations and their corresponding p-values. It can be noted that features such as Failed Interrupts, Response Time and Overlap correlate directly with the negative symptoms, i.e., cases which saw an increased Response Time or Overlap during the interview also saw a higher rating on equivalent items of the subjective NSA-16 scale, indicating greater severity of negative symptoms. The reverse situation occurred with the features such as Natural Turns, Speaking %, or Turn Duration; the reduced values of such features, associated often with disjoint communication, saw increased ratings on the speechrelated NSA-16 items, resulting in negative correlations.
It can be seen from the An interesting observation is the significant correlations between Reduced Expressive Gestures with non-verbal speech features. It has significant positive correlation with Mutual Silence, and Speech Gap, showing a decrease in gesture usage if there is more silence. On the other hand Reduced Expressive Gestures has significant negative correlation with Turn Duration, and Speaking %, which shows that the gesture usage increases with increase in speech.
The results in Table 6 present the detection accuracies for Prolonged time to respond, Restricted speech quantity, Impoverished speech content, Emotion: Reduced range, Affect: Reduced modulation of intensity, Reduced social drive, and Reduced expressive gestures. We achieve higher accuracy of 79.6%, 77.8% and 70.4% for Prolonged time to respond, Reduced expressive gestures and Restricted speech quantity. For Impoverished speech content, Emotion: Reduced range, and Affect: Reduced modulation of intensity, we achieve rather moderate accuracies of 59.3%, 53.7%, and 59.3%. The NSA item Reduced social drive has a large imbalance between non-observable and observable classes. These results correspond to Table 5, where high correlation values lead to a better prediction in most of the cases.
The results in Table 7 present the accuracies for patient v/s healthy classification. The highest accuracy of 81.3% was achieved by Multi-layer perceptron. Other algorithms like SGD, Random Forest, and Random-subspace (ensemble) achieve 70.0%, 72.5%, and 77.5% respectively [41].
In this paper we have presented our study on the objective and automated analysis of the speech deficiencies associated with schizophrenia. Our study is unique in several aspects. First of all, our dataset contains 80 subjects, including 56 patients and 24 healthy controls, which is a larger group than in most related studies. Moreover, the recordings are about 30 minutes long, and are substantially longer than recordings in similar studies typically lasting only a few minutes. Also we have not edited the recordings in any way, and analyzed the entire recordings, instead of selecting particular segments. As a result, our approach could be applied in realistic settings such as face-to-face interviews or phone calls, as there is no need for manual editing.
Moreover, the semi-structured nature of our participant interviews attempts to recreate an environment that is close to real-life clinical settings. We are interested in the social and cognitive behavioral patterns of the participants in their everyday lives, hence we decided not to apply any external stimuli, in contrast to earlier studies [29].
Our approach is more comprehensive compared to earlier studies, as we combine both prosodic as well conversational speech cues; this allowed us to conduct a more in-depth analysis. These objective cues, once validated through their strong correlations with established subjective measures, were utilised to predict the aforementioned subjective measures and to distinguish patients from healthy controls. The conversation speech cues used in this study and the correlations between NSA items and these conversational features are unique to our research.
A few earlier studies describe their attempt to develop automated methods to analyze audio and speech features of individuals suffering from schizophrenia. However, these approaches have the following limitations. Either the studies only consider prosodic cues related to flattening of affect or alogia as in [29], [30], [31], and [18], or are limited in the duration of speech data as in [30] (duration-1 minute), [31] (duration-10 minutes) and [32] (first paragraph read from a medieval classic). Only Cohen et al. [27] explored a single conversational feature (speech rate) together with other prosody features (inflection). None of the above studies utilized these speech features to distinguish between the patients and healthy controls with the exception of the one conducted by Martinez et al. [32], who achieved a classification accuracy of 93.8%. However, as pointed out earlier, their data is of rather short duration, and hence, these results may be less reliable compared to our results, obtained from 30 minutes recordings. Such long-term recordings give us the opportunity to explore the correlations between non-verbal speech cues and negative symptoms in individuals with schizophrenia with greater depth. We believe that more studies of this nature are required, with more and longer recordings in realistic settings, to fully establish the effectiveness of non-verbal audio cues for assessing the negative symptoms of schizophrenia. These results can be the stepping stone towards building an automated tool which could predict negative symptoms by analyzing the speech of a patient in an automated manner. The results of our patient v/s healthy classification analysis are also very promising. It shows that participants can indeed be classified as individuals with schizophrenia or as healthy individuals on the basis of their non-verbal speech features.
Also, in the future, we plan to explore the temporal variation of the speech cues for the subjects and controls, specifically, how their speech features change over sessions, and, if at all, in different manner for subjects and controls. In this paper, we have only investigated the nonverbal cues related to speech. Visual non-verbal cues such as posture, gestures, etc., have not been investigated here, which we plan to address in the future.
The present study is not without limitations. As this is a cross-sectional study, no conclusion could be drawn about the stability of the relationship between the objective measure and subject measure over time. Some factors that might affect negative symptoms and speech production such as participants' smoking history and extrapyramidal symptoms were not assessed and controlled in this study. The correlations between the objective speech cues and NSA ratings ranged between 0.5 and -0.5. Therefore there were a big percentage of variance could not be explained. In addition, this study only tested the relationship between objective measures and subjective measures of negative symptoms in schizophrenia. Future study might want to examine this relationship in other psychiatric disorders such as depression or bipolar disorder to explore whether the relationship is generalizable to other psychiatric conditions.

Conclusion
In this paper we presented our findings regarding the correlations between the non-verbal speech cues and negative symptom ratings. Existing schizophrenia related research mostly focuses on semantic analysis and natural language processing. Little attention has been given to non-verbal speech features. In this paper we have highlighted the significance of these non-verbal speech features by showing their correlations with the negative symptoms for schizophrenia.
The results of our analysis are promising as there are significant correlations between nonverbal speech features and NSA-16 ratings. We also predicted NSA-16 criteria using machine learning algorithms trained on subjective ratings. The results show that some of the NSA-16 criteria can be classified as observable and non-observable using non-verbal speech features with quite high accuracy.
The positive findings from our analysis pave the way for identifying speech characteristics as markers for negative symptoms, developing a technological application that detects clinically significant speech patterns may assist clinicians in postulating functional level of an individual. With the gathered data from this study, we will work towards the development of a prototype, possibly as a mobile application to facilitate implementation.