Detecting Parkinson’s disease from sustained phonation and speech signals

This study investigates signals from sustained phonation and text-dependent speech modalities for Parkinson’s disease screening. Phonation corresponds to the vowel /a/ voicing task and speech to the pronunciation of a short sentence in Lithuanian language. Signals were recorded through two channels simultaneously, namely, acoustic cardioid (AC) and smart phone (SP) microphones. Additional modalities were obtained by splitting speech recording into voiced and unvoiced parts. Information in each modality is summarized by 18 well-known audio feature sets. Random forest (RF) is used as a machine learning algorithm, both for individual feature sets and for decision-level fusion. Detection performance is measured by the out-of-bag equal error rate (EER) and the cost of log-likelihood-ratio. Essentia audio feature set was the best using the AC speech modality and YAAFE audio feature set was the best using the SP unvoiced modality, achieving EER of 20.30% and 25.57%, respectively. Fusion of all feature sets and modalities resulted in EER of 19.27% for the AC and 23.00% for the SP channel. Non-linear projection of a RF-based proximity matrix into the 2D space enriched medical decision support by visualization.


Introduction
Parkison's disease (PD) is the second most common neurodegenerative disease after Alzheimer's [1] and it is anticipated that the prevalence of PD is going to increase due to population ageing. The loss of dopaminergic neurons can reach up to 50% at the time of clinical diagnosis [2] and rapidly increases completing by 4 years post-diagnosis [3]. Any neuroprotective strategies that may emerge in the near future could be too late to effectively slow down the neurodegenerative process. Therefore, early objective diagnostic markers are critically needed. Amongst many other symptoms, PD manifests itself through speech disorders, which can be observed as early as 5 years before the diagnosis [4]. Investigations show that Parkinsonian vocal dysfunction can be characterized by: reduced vocal tract volume and reduced tongue flexibility, significantly narrower pitch range, longer pauses and smaller variations in pitch range, voice intensity level, and articulation rate. Therefore, automated acoustic analysis is considered by many researchers as an important non-invasive tool for PD screening. To this PLOS  recordings, reported in these researches, could be suspected to be achieved due to the lack of conformity to leave-one-subject-out scheme. Accuracy of 79.17% was obtained by [8] when categorizing the Oxford dataset into the healthy class and three classes of PD of different severity, but it remains not clear, if disjointedness with respect to subjects was followed. Comparison of validation approaches in [10] using the Oxford dataset reveal 81.53±2.17% accuracy for leave-one-individual-out versus 92.75±1.21% accuracy for leave-one-out sampling. Naranjo et al. [27] suggested using a subject-based Bayesian approach to deal with dependency in a "replicated measure-based design" (several recordings from one subject), demonstrating 75.2% accuracy on a dataset of 40 PD and 40 HC subjects. The main emphasis of the related work remains on the extraction of various feature sets. Some researchers use large sets of audio features with an aim to comprehensively characterize recordings [15], including the renowned cepstral coefficients such as Mel-frequency (MFCCs) or perceptual linear predictive (PLPCCs), while others adopt only "clinically useful" measures or apply feature selection [11] to arrive at a compact set of audio descriptors. Comprehensive review of the related work was recently compiled by [17]. There is a lack of studies comparing performance of popular audio feature sets on the same dataset and considering fusion of feature sets from several modalities. Due to variety of datasets and performance assessment procedures used in different studies, and also due to different preferences and approaches for feature engineering, the question concerning the discriminatory power of various well-known audio feature sets remains unanswered. We try to address the aforementioned problems by exploring 18 diverse collections of audio descriptors on the same database recorded through two channels-acoustic cardioid (AC) and smart phone (SP) microphones. Unimodal and multimodal decision-level fusions of individual feature sets from phonation, speech, and voiced / unvoiced modalities are considered for a robust and accurate PD detection. Variable importance as a mean decrease in detector's accuracy is reported. Finally, a convenient solution regarding data visualization for medical decision support is demonstrated.

Phonation and speech data
Two vocal tasks were recorded in a sound-proof booth and treated as separate modalitiesphonation and speech. Phonation modality contains a sustained voicing of vowel /a/ vocalized at a comfortable pitch and loudness level for at least 5 s and repeated 3 times. Speech modality contains a single pronunciation of a phonetically balanced sentence in a native Lithuanian language,-"turėjo senelė žilą oželį",-which translates into "granny had a little greyish goat". Speech recording was split using Praat software into voiced / unvoiced parts, which were treated as additional modalities in experiments. Audio samples were recorded using two channels simultaneously-acoustic cardioid (AKG Perception 220, frequency range 20-20000 Hz) and a smart phone (an internal microphone of Samsung Galaxy Note 3). Both microphones were located at *10 cm distance from the mouth. The audio format was mono PCM wav (16 bits at 44.1 kHz sampling rate). A mixed gender database was collected where 99 subjects had both AC and SP recordings. One PD male subject had AC speech recording missing, therefore, fusion of modalities for AC channel was possible only for 98 subjects. Full details are in Table 1.

Audio feature sets
Information from an audio recording of phonation or speech signal can be extracted using a variety of signal analysis techniques. Computed measures are commonly known as features. Full list of audio feature sets used in this study is provided in Table 2. All the feature sets were published before and most have publicly downloadable feature extractors. With regard to the amount of signal used for calculations, features can be categorized into: • global, long-term or high-level descriptors; • local, short-term or low-level descriptors (LLDs).
The local features are obtained by dividing a recording into short overlapping frames and applying an algorithm that computes a respective LLD for each frame. LLDs subsequently can be compressed into high-level descriptors by computing various statistical functionals. The feature sets # 1-12 had their own predefined choice from 42 statistical functionals. Statistical functionals for the feature sets # 13-17 correspond to the following 13 characteristics: minimum, maximum, mean, median, lower quartile (Q lo ), upper quartile (Q up ), trimean

OpenSMILE features
The feature sets # 1-12 are computed using preconfigured setups available in the openSMILE [28] toolkit (version 2.2 RC 1). Name of each feature set is identical to the name of the configuration (.conf) file. Most of these setups are quite similar, therefore, for illustration, only contents of emobase.conf are specified in Table 3. The feature set emobase, introduced for emotion recognition, contains 26 LLDs and also the 1st derivative (delta or velocity) of each LLD. To summarize various aspects of frame-based data distribution for each LLD and its delta, a collection of statistical functionals is applied. The overall size of the feature set is 988 features = (26 LLDs + 26 deltas) × 19 functionals. The file emobase.conf contains these processing-related settings: • pitch and pitch envelope are estimated using pre-emphasis (of 0.97) and overlapping (by a step of 10 ms) Hamming windows (of 40 ms duration); • other LLDs are obtained without pre-emphasis and the signal is windowed into overlapping (by a step of 10 ms) Hamming windows (of 25 ms duration).
Computed LLDs are smoothed with a simple moving average filter (window size = 3) before compressing by statistical functionals.

Essentia descriptors
The feature set # 13 was computed using an open-source C++ library for audio analysis-Essentia [29] (version 2.1 beta 2)-and its out-of-the-box feature extractor streaming_extractor_freesound.exe (version 0.3). The lowlevel and sfx descriptor types were used and the tonal and rythm descriptor types were discarded (due to the fact that analysed signals are human voice and speech but not music). A detailed list of 1915 (17 global + 146×13 local) descriptors: • 1 global descriptor of the lowlevel type-average loudness; • 16 global descriptors of the sfx type-5 temporal (centroid, decrease, kurtosis, skewness, spread), 4 morphological (the ratio between the index of the maximum value of the envelope of a signal and the total length of the envelope, the ratio of the temporal centroid to the total length of a signal envelope, the weighted average of the derivative after the maximum amplitude, the maximum derivative before the maximum amplitude), pitch centroid, strong decay, flatness, log attack time of a signal envelope, the ratio between the index of the maximum value of the pitch envelope of a signal and the total length of the pitch envelope, the ratio between the index of the minimum value of the pitch envelope of a signal and the total length of the pitch envelope, and the ratio between the pitch energy after the pitch maximum to the pitch energy before the pitch maximum; • 141 local descriptors of the lowlevel type-spectral energy in 77 bands (28 frequency bands, 4 bands of low/mid-low/mid-high/high frequencies, 18 ERB bands, 27 Bark bands), 3 statistics of spectral energy in Bark bands (kurtosis, skewness, spread), 13 MFCC, 13 GFCC (using Gammatone filterbank), 15 spectral (energy, entropy, complexity, centroid, strong peak, crest, Masri-Bateman high frequency content measure, RMS, roll-off, decrease, flatness in dB, flux, kurtosis, skewness, spread), 6 spectral contrasts, 6 spectral contrast valleys, 3 pitch-related (pitch, instantaneous confidence of pitch, salience of pitch), 3 silence rates (20 dB, 30 dB, 60 dB), dissonance, and zero-crossing rate; • 5 local descriptors of sfx type-3 tristimulus values, inharmonicity, and odd-to-even harmonic energy ratio.

KTU features
The feature set # 15 was introduced for voice pathology screening by [36] at Kaunas University of Technology and later expanded to include additional features. The latest variant of this feature set was devised here by combining feature subsets # 1-13 of [31] with MFCC and PLPCC features of [32]. For MFCC and PLPCC features the signal is pre-emphasized by 0.97 and frames are computed using the sliding 10 ms (440 samples) Hamming window with 5 ms overlap. The frame-based 19 MFCCs and 19 PLPCCs were characterized by 13 statistical functionals, resulting in a subset of 494 features. Combining 773 [31] and 494 [32] features formed the KTU feature set of 1267 features.

jAudio features
The feature set # 16 was computed using the Java application jAudio [33] (version 0.4.5.1), which was developed as a standardized audio feature extraction system for automatic music classification. All features selected were frame-based with window size of 1024 (corresponding to *23.3 ms frame length) and window overlap of 50%. A detailed list of 1794 (138×13 local) features: 100 area (zeroth moment) estimates from 2D method of moments analysis of spectral data frames, 13 MFCC, 10 LPC, 4 spectral (centroid, flux, rolloff point, variability), 3 strongest frequency (via zero crossings, via spectral centroid, via FFT maximum), 2 partial-based spectral (centroid, flux), peak-based spectral smoothness, compactness, root mean square, fraction of low energy windows, relative difference function, and zero crossings.

Tsanas features
The feature set # 18 contained various dysphonia measures and was dedicated initially specifically for PD screening. The Matlab code to compute these features is publicly available as Voice Analysis Toolbox (version 1.0) and the full list of 339 features is described in PhD thesis of [35]. Collection of audio features contains: jitter variants, shimmer variants, harmonic-to-noise ratio, noise-to-harmonics ratio, glottal quotient, glottal-to-noise excitation ratio, vocal fold excitation ratio, entropy of intrinsic mode functions from empirical mode decomposition, log energy, 13 MFCCs and their 1st and 2nd differences, de-trended fluctuation analysis, pitch period entropy and recurrence period density entropy.

Methodology
Random forest (RF) [37] was used as a supervised algorithm to detect PD and also to fuse information in the form of soft decisions, obtained using various audio feature sets from separate modalities.

Random forest
RF is a committee of decision trees, where the final decision is obtained by majority voting. The basic idea of RF is to combine many (B in total) unprunned CART (classification and regression tree) models, built on different bootstrap samples of the original dataset X and a random subset (of predetermined size q) of features x 1 , . . ., x p . For our experiments B was 5000 and after testing several specific values of q ( ffiffi ffi p p , 2 Á ffiffi ffi p p , 1 2 Á p) the best performing (giving the lowest C llr ) q was chosen.
RF is known to be robust against over-fitting and as the number of trees increases, the generalization error converges to a limit [37]. Low bias and low correlation are essential for the robust generalization performance of the ensemble. To get low bias, trees are unpruned (grown to the maximum depth). To achieve the low correlation of trees, randomization is applied.
RF is constructed in the following steps: 1. Choose the forest size B as a number of trees to grow and the subspace size q p as a number of features to provide for each tree node.
2. Draw a bootstrap sample (random sample with replacement) of the dataset, which generally results in $ 2 3 Á n unique observations for training, thus leaving $ 1 3 Á n for testing as the outof-bag (OOB) dataset for that particular tree, where n is the number of observations in the dataset.
3. Grow an unpruned tree using the bootstrap sample. When growing a tree, at each node, q variables are randomly selected out of the p available.
The generalization performance of RF was evaluated using internal out-of-bag (OOB) validation, where each observation is classified only by the trees which did not have this observation in the bootstrap sample during construction. It is well known that the OOB validation provides an unbiased estimate of a test set error, similar to the leave-one-out scheme. Because of the "repeated measures" aspect, often arising in the phonation modality when each subject is represented by several recordings of voiced vowel, the sampling part of the Matlab implementation [38] had to be modified to ensure that all recordings of each subject are included either in a bootstrap sample or left aside as OOB. Added modification conforms to the leaveone-subject-out approach and helps to avoid biased evaluation when pathology detection intermingles with speaker detection. Additionally, the RF setting of stratified sampling was configured to preserve the class and gender balance of the full dataset in each bootstrap sample.

Decision-level fusion
Individual RFs were built independently using various feature sets and OOB decisions of these individual experts were combined in a meta-learner fashion. RF was applied both as a baselearner and as a meta-learner. Therefore, outputs from the first stage base RFs are concatenated into a new feature vector, which becomes an input for the second stage meta RF. In the detection task, an input to the meta-learner is the difference between class posteriori probabilities computed by the base-learner. Given a trained base-learner, this difference is estimated as: where x is the object being classified, b is the number of trees t 1 , . . ., t b in the RF for which observation x is OOB, c is a class label (1 corresponds to HC, 2 to PD), and f(t i , x, c) stands for the c-th class frequency in the leaf node, into which x falls in the i-th tree t i of the forest: where C is the number of classes and n(t i , x, c) is the number of training data from class c falling into the same leaf node of t i as x.
Additionally, for the purpose of visualization, a data proximity matrix F was obtained from the best meta-RF. Proximity matrix is constructed as follows: observations, represented by the meta-features, are run down each tree grown and the matrix element ϕ ij is increased by one when two observations x i and x j are found in the same terminal node of the tree. After the meta-RF is constructed, proximities are obtained and divided by the total number of trees in the meta-RF. To project data into the 2D space, the proximity matrix F was converted through a simple 1 − F operation into a distance matrix and was provided as an input to the t-distributed stochastic neighbor embedding (t-SNE) algorithm [39] to implement dimensionality reduction. The main tunable parameter of t-SNE is perplexity, which controls the trade-off between concentrating on local versus global aspects of the data [40] and is comparable to the number of nearest neighbors in other manifold learning algorithms.
Assessing detection RF detector's scores for OOB data were used to evaluate the goodness of detection. Votes of RF were converted to a proper score by dividing votes for a specific class from the total number of times the case was OOB, as in formula (1). Soft decision (score) instead of hard decision (predicted class) makes evaluation more precise by enabling visual summary of detection performance through the detection error trade-off (DET) curve, as recommended by [26]. A quick way to compare detectors with different DET curves is the equal error rate (EER)-the equilibrium point where curve intersects diagonal [41] and false positive rate (miss rate) becomes equal false negative rate (false alarm rate) or true positive rate (sensitivity) becomes equal true negative rate (specificity). The minimum cost of log-likelihood-ratio (C llr ) is a comprehensive detection metric used here as the main criterion for model selection. The log-likelihood-ratio is the logarithm of the ratio between the likelihood that the target (PD person) produced the signal and the likelihood that a non-target (HC person) produced the signal. The DET curve, EER and C llr measures were computed by the ROC convex hull method using the BOSARIS toolkit [42]. A well-calibrated and useful detector should provide C llr < 1 and EER < 50%.

Experimental results
The detection performance of individual feature sets was evaluated by estimating recordingbased C llr and EER measures. Then various unimodal and multimodal decision-level fusions were tested. Numbers corresponding to the minimum of each table column are denoted in bold italic font style.

Individual feature sets
Detection performance obtained using individual feature sets is summarized in Tables 4 and 5. DET curves of the best performing feature set (having the lowest EER) for each modality are provided in Fig 1. Essentia descriptors were the best using the AC channel for phonation, speech and voiced modalities, providing EER of 20.78%, 20.30%, and 24.52%, respectively. The best performance for unvoiced modality using the AC channel was EER of 24.89% obtained by IS13_ComParE features. The best individual performance using the SP channel for phonation, speech, voiced and voiced modalities was observed with Tsanas, jAudio, IS11_speaker_state and YAAFE features, providing EER of 29.02%, 26.12%, 28.36% and 25.57%, respectively. If instead of EER we consider C llr , the best feature set for the unvoiced modality of the AC channel is IS12_speaker_trait_compat and for the voiced modality of the SP channel is Tsanas. Phonation is often outperformed by the speech, especially in the SP channel, where the exception to this tendency is shown only by 2 feature sets (# 1-2) according to C llr or 5 feature sets (# 1-4, 8) according to EER. Interestingly, the best individual performance in the AC channel was observed for the speech recording, but in the SP channel for the unvoiced part of the speech.
Variable importance analysis for the best performing feature sets is summarized in S1 File. Results for the AC phonation and speech modalities indicate the frequency band, Bark frequency band and the spectral statistics as the most important audio features. Results for the SP microphone indicate the MFCCs in speech and spectral statistics in unvoiced modality as the most important audio features.

Decision-level fusion
Decision-level fusion of individual feature sets from all modalities helped to improve detection performance slightly according to EER (compare Table 5 with Table 6 and DET curves in Fig  1), where the best average EER was 19.27% for the AC and 23% for the SP channel. Meanwhile, according to C llr no fusion variant could improve over performance of the best individual feature set from the single modality (compare 0.529 of AC speech and 0.623 of SP unvoiced in Detecting Parkinson from phonation and speech Table 4 with 0.553 of AC S+V+U and 0.646 of SP S+U in Table 6). Therefore, for the data investigated decision-level fusion remains of questionable effectiveness.

Visualization by the 2D map
The proximity matrix, obtained from the meta-RF, contains information about the pair-wise similarity between recordings with respect to the various feature sets. A compelling property of such a matrix is that only feature sets contributing to the construction of the meta-RF have an affect on similarity values. Therefore, the influence of unimportant, noisy feature sets gets reduced. Proximity matrices were obtained from the ultimate meta-RF model built fusing   Fig 2, where separate clusters of PD and HC subjects can be noticed. Having a new recording with an unknown diagnosis, it could be converted to the audio features and fed to the first stage RFs, constructed on the individual feature sets. Then resulting decisions should be streamlined to the second stage meta-RF and its proximity matrix augmented with similarities of this unknown recording to the observations with available diagnosis. Running the t-SNE on a distance matrix, obtained from the augmented proximity matrix, would result in a new recording located as a distinct point in the 2D space among the known cases (among the points in Fig 2). Exploring location of the new recording with respect to the points in the vicinity can be a useful data-driven exploratory approach for PD screening.

Conclusions and future directions
The best individual feature set was Essentia when using speech modality of the AC microphone and YAAFE when using unvoiced modality of the SP microphone, achieving EER of 20.30% and 25.57%, respectively. Speech signal tends to outperform phonation in the PD detection task when using the SP microphone. Splitting of speech signal into voiced / unvoiced modalities, as recommended by [17], was found to be useful in the SP case.
Fusion of all feature sets and modalities resulted in EER of 19.27% for the AC microphone and EER of 23% for the SP microphone. Improvement from fusion was evident only according to EER, but according to the more comprehensive C llr measure fusion is not effective for the data analysed. The non-linear mapping of proximity matrix obtained from the meta-RF into the 2D space was shown to enrich medical decision support by allowing to spot similar cases conveniently. Detecting Parkinson from phonation and speech Detection performance was consistently better for the AC than for the SP microphone.
Nonetheless, text-dependent speech recordings of SP quality and especially their unvoiced part have potential for PD detection. More vocal exercises, like rapid speech movements through succession by the diadochokinetic task of /pa/-/ta/-/ka/ repetition, could also be useful. Additional information is worth considering by tracking an accelerometer signal in a posture test [19] of standing still and/or holding device in a hand with an arm extended or in a gait test [19] of walking. Tapping and reaction time tests [19] or drawing of an Archimedean spiral [43] are an interesting type of tactile tasks which could be recorded using a hand-held device. Fusion of information from diverse non-invasive modalities could help to develop an efficient SP-based tool for PD screening.

Ethical statement
The study protocol has been approved by Kaunas Regional Bioethics Committee (P2-24/ 2013). Written informed consent was obtained from the study participants, patient identifiers were removed to ensure anonymity.
Supporting information S1 File. Variable importance analysis in the task of PD detection is reported for the best performing RF and meta-RF models.