Analyzing the effectiveness of vocal features in early telediagnosis of Parkinson's disease

The recently proposed Parkinson’s Disease (PD) telediagnosis systems based on detecting dysphonia achieve very high classification rates in discriminating healthy subjects from PD patients. However, in these studies the data used to construct the classification model contain the speech recordings of both early and late PD patients with different severities of speech impairments resulting in unrealistic results. In a more realistic scenario, an early telediagnosis system is expected to be used in suspicious cases by healthy subjects or early PD patients with mild speech impairment. In this paper, considering the critical importance of early diagnosis in the treatment of the disease, we evaluate the ability of vocal features in early telediagnosis of Parkinson's Disease (PD) using machine learning techniques with a two-step approach. In the first step, using only patient data, we aim to determine the patient group with relatively greater severity of speech impairments using Unified Parkinson’s Disease Rating Scale (UPDRS) score as an index of disease progression. For this purpose, we use three supervised and two unsupervised learning techniques. In the second step, we exclude the samples of this group of patients from the dataset, create a new dataset consisting of the samples of PD patients having less severity of speech impairments and healthy subjects, and use three classifiers with various settings to address this binary classification problem. In this classification problem, the highest accuracy of 96.4% and Matthew’s Correlation Coefficient of 0.77 is obtained using support vector machines with third-degree polynomial kernel showing that vocal features can be used to build a decision support system for early telediagnosis of PD.


Introduction
Parkinson's disease (PD) is one of the most frequently seen neurodegenerative disorders affecting the human central, peripheral, and enteric nervous systems [1]. In a recent study that synthesized studies on the prevalence of PD, meta-analysis of the worldwide data showed that PD prevalence increases steadily with age from 41/100000 in 40 to 49 years to 1903/100000 in older than 80 years [2]. The standardized incidences reported in previous studies ranged from 16 to 19 per 100000 per year [3]. Many studies have reported that PD incidence also rises PLOS  steadily with age to a peak occurring at the age of 70 to 79 years [4]. However, it is also noted that this may be because of the difficulty in identifying very elderly patients [3]. These findings show that aging of general population will bring about a dramatic increase in in the number of people diagnosed with PD [5]. Therefore, there is an increasing need to build reliable telemedicine systems that alleviates the burden of frequent physical visits to the clinic and uncompensated medical expenditures [6][7][8][9]. Although there is no cure for the disease, some symptoms of the Parkinson's disease can be suppressed by pharmacological or surgical intervention [10]. Thus, early diagnosis is of critical importance as it enables the early introduction of treatment that can improve the quality of life and extend the life span of the patients. Unified Parkinson's Disease Rating Scale (UPDRS) is the most commonly used rating tool to follow the PD progression and evaluate the results of surgical, medical, and other interventions of the disease [11][12][13]. The UPDRS is composed of three main components: first is the "mentation, behavior and mood", which consists of 4 sections; second is the "activities of daily living" which consists of 13 sections and assesses whether a PD patient can fulfill daily tasks without any assist; and the third is "Motor" which consists 27 sections and evaluates muscular control [14,15]. The effect of speech shows up in two components: primarily in the 5 th section of component 2 for assessing whether the patient's vocal output is apprehensible and secondly in the 18 th section of component 3 for evaluating whether the patient's vocal output is expressive during a conversation. The UPDRS is highly used due to its various strengths: (i) it assesses both motor disability (second component) and motor impairment (third component), (ii) a teaching-videotape is used to standardize the practical application and this enhances the inter-rater reliability [10,16] (iii) its reliability and validity has been assessed many times due to its clinometric scale evaluation ability. Its reliability was examined in literature in terms of internal consistency [8,11,17,18], inter-rater reliability 11,17], intra-rater reliability [19], test-retest reliability in elderly patients with parkinsonian signs (but not necessarily PD) [19] and test-retest reliability in patients with early, mild PD [12]. The results have demonstrated that although there are some items focused on the same aspect of the construct, UPDRS is one of the most valid and reliable scale that can be used to follow the course of PD. However, as the patients live longer, even some of the symptoms are treated; it is becoming more and more difficult for the patients to come to visits in hospital for diagnosis, monitoring, and treatment. Therefore, tele-monitoring of signs can complement traditional clinical examinations [20] and decrease the number of physical visits to clinics. Consequently, the life of PD patients and their relatives may be easier and the workload of clinicians may reduce.
Vocal impairment is one of the most important signs of PD since it is seen in approximately 90% of the patients in the earlier stages of the disease [8,18]. Therefore, there is an increasing interest in building PD diagnosis and telemonitoring systems based on vocal features. The tele-diagnosis systems aim to discriminate PD patients from healthy subjects [8,[21][22][23][24][25] and the telemonitoring systems aim to predict the clinical evaluation metrics in order to track the sign progression of the disease [8,14,26]. Most of the telediagnosis studies use an online available Parkinson voice dataset which consists of 195 voice recordings belonging to 23 PD patients and 8 healthy subjects [8]. In a recent study, clustering based feature weighting and complex valued artificial neural network were combined to discriminate healthy subjects from PD patients and 99.52% classification accuracy was achieved on this dataset [22]. Similar studies that address the telediagnosis problem have obtained similar classification performances by combining feature selection and classification algorithms [21,24,25].
The PD tele-monitoring studies based on speech recordings of PD patients aim to map the vocal features to a clinical evaluation system used to describe how the signs of Parkinson's disease progress. Since UPDRS is the most widely used scale, many researches are trying to estimate the whole or a part of the UPDRS score using data that is retrieved by teleprocessing. In a study conducted at University of California, data collected with an at-home testing device recording both motor and speech data was used with linear regression to estimate UPDRS score [20]. Khan et. al [27] extracted 13 features, including the cepstral separation difference and Mel-frequency cepstral coefficients, from 240 running speech samples recorded from 60 PD 20 healthy controls and predicted UPDRS Motor Examination of Speech with 85% accuracy by using SVM. In another recent study, Bayestehtashk et al. [14] has collected a relatively large cohort of 168 subjects remotely from three different clinics and obtained mean absolute error of 5.5 in predicting the UPDRS score.
As mentioned above, although there have been many studies aiming at building PD telediagnosis and telemonitoring systems based on vocal features, the ability of vocal features in early telediagnosis of PD have not been investigated yet in the literature. Many literature studies that propose telediagnosis systems based on speech disorders reported very high classification rates in discriminating healthy subjects from PD patients [21][22][23][24]. However, in these studies the data used to build the classification model contain the speech recordings of both early and late PD patients with different severities of speech impairments. In a more realistic scenario, the telediagnosis system is expected to be used in suspicious cases by healthy subjects and patients with mild motor system disorders. In the literature, it has been found that speech disorders have the potential to be the early indicators of PD. In [28], using disease duration as an index of disease progression, the association between disease duration and various UPDRS subscores is examined, and the findings revealed that activities of daily living (ADL) subscore and motor subscore, each including a speech part, are strongly associated with disease duration. In this paper, considering the critical importance of early diagnosis in the treatment of the disease, we investigate the usefulness of vocal features in early telediagnosis of Parkinson's Disease (PD) using machine learning techniques. We address this problem with a two-step approach. In the first step, the aim is to identify the patient group with comparably greater severity of speech impairments using Unified Parkinson's Disease Rating Scale (UPDRS) score as an index of disease progression [29]. We utilize three supervised learning approaches to determine this patient group. We also apply two unsupervised approaches to validate the results obtained with the classification procedure. Then, in the second step of our approach, we exclude the samples of this group of patients with severe speech disorders from the dataset and create a new dataset consisting of the samples of PD patients with mild speech disorders and healthy subjects. We feed this dataset to three different classifiers and present the results in detail. Thus, we aim to analyze the usefulness of vocal features in discriminating the PD patients with early signs of speech disorders and healthy subjects. The highest accuracy of 96.4% and Matthew's Correlation Coefficient of 0.77 obtained using SVM with third-degree polynomial kernel show that vocal features are effective in discriminating healthy subjects and PD patients with mild speech disorders and can be used for early telediagnosis of the disease.

Dataset description
In the first step of our approach, we use a Parkinson's Disease (PD) telemonitoring dataset consisting of speech recordings of 42 PD patients, which was collected under the supervision of six U.S. medical centers within the context of Tsanas et al.'s study [8] and is available online at UCI machine-learning archive [30]. As described in [8], the data were collected remotely at the patient's home and transmitted over the internet [8]. The PD patients were diagnosed within the previous five years at trial onset if he/she had at least two of the following symptoms: rest tremor, bradykinesia or rigidity, without evidence of other forms of parkinsonism. The patients' ages ranged from 36 to 85 years (mean age 64.4 ± 9.24) [8]. More detailed description of the data collection process can be found in [8]. The feature set extracted from the voice recordings consists of 16 features which are shown in Table 1 [8,30,31]. The statistical parameters of these features are given in Table 2. The feature set includes several measures of fundamental frequency, several measures of variation in amplitude, noise-to-harmonics and harmonics-to-noise ratios, nonlinear dynamical complexity measure, signal fractal scaling exponent, and pitch period entropy. The PD patients were monitored for a six-month period, and remained un-medicated during the duration of the study [8]. The voice recordings of the subjects were obtained at weekly intervals for the six-month duration of the study whereas motor and total UPDRS were assessed only three times by the medical staff: at baseline (onset of trial), and after three and six months. The missing weekly UPDRS estimates corresponding to the weekly voice recordings were obtained using linear interpolation [8]. During the six months data collection period, in each trial, six sustained phonations of the vowel /a/ were recorded summing up to 5875 voice recordings. The motor UPDRS score of the PD patients

Vocal Feature Description
Jitter(%) Average absolute difference between consecutive periods, divided by the average period.
Jitter(Abs) Average absolute difference between consecutive periods which gives information about the cycle-to-cycle variation of fundamental frequency given in seconds.
Jitter:RAP Relative Average Perturbation (RAP), which is the average absolute difference between a period and the average of it and its two neighbours, divided by the average period.
Jitter:PPQ5 Five-point Period Perturbation Quotient, computed as the average absolute difference between a period and the average of it and its four closest neighbours, divided by the average period.
Jitter:DDP Average absolute difference between consecutive differences between consecutive periods, divided by the average period.

Shimmer
Average absolute difference between the amplitudes of consecutive periods, divided by the average amplitude.

Shimmer(dB)
Average absolute base-10 logarithm of the difference between the amplitudes of consecutive periods, multiplied by 20. It gives information about the variability of the peak-to-peak amplitude in decibels.
Shimmer:APQ3 Three-point Amplitude Perturbation Quotient, the average absolute difference between the amplitude of a period and the average of the amplitudes of its neighbours, divided by the average amplitude.
Shimmer:APQ5 Five-point Amplitude Perturbation Quotient, the average absolute difference between the amplitude of a period and the average of the amplitudes of it and its four closest neighbours, divided by the average amplitude.
Shimmer:APQ11 11-point Amplitude Perturbation Quotient, the average absolute difference between the amplitude of a period and the average of the amplitudes of it and its ten closest neighbours, divided by the average amplitude.
Shimmer:DDA Average absolute difference between consecutive differences between the amplitudes of consecutive periods.

Noise to Harmonics Ratio (NHR)
Amplitude of noise relative to tonal components. It quantifies the noise which occurs due to turbulent airflow, resulting from incomplete vocal fold closure in speech pathologies.
Harmonics to Noise Ratio Amplitude of tonal relative to noise components. It has the same aim as NHR.
Recurrence period density entropy monitored in this study range from 5 to 40. The control group dataset used in the second step of our analysis is a part of another PD dataset which was collected by Little et al. [23] and is also online available at UCI machine-learning archive [32]. It contains 48 samples belonging to 8 healthy subjects. We refer the reader to [23] for more detailed description of the dataset containing the healthy subjects.

Determination of UPDRS threshold
Binary classification. In our two-step approach, first we aim to determine the optimal motor UPDRS threshold value that can be discriminated with the lowest possible error rate using vocal features. This threshold value represents the level of disease as determined by UPDRS, after which dysphonia or change in voice quality becomes apparent. For this purpose, we discretize the UPDRS scores of PD patients into two classes, "Below threshold" and "Above threshold", for various motor UPDRS threshold values. The interval of the UPDRS threshold value that has been evaluated is determined so that each of the classes contains at least 10% of the total number of samples. For each possible optimum threshold value, we apply a binary classification procedure to discriminate the PD patients having UPDRS values below the determined threshold, labeled "negative", and above the determined possible threshold, labeled "positive [29].
The features are fed to Support Vector Machines (SVM), Extreme Learning Machines (ELM), and k-nearest neighbors (k-NN) classifiers for each of the binary classification problems obtained with various UPDRS threshold values. Although we present the results in terms of accuracy and Matthew's Correlation Coefficient (MCC) evaluation metrics, since the binary classification problem obtained according to the determined UPDRS threshold value may result in imbalanced datasets in which sample from one class is in higher number than other, we take the MCC metric into account to determine the maximally predictable UPDRS threshold value. The MCC metric is a balanced measure which can be used even if the classes are of very different sizes. It gets a value between -1 and +1. The formulation of MCC metric is given below: where TP and TN represents the number of correctly classified positive and negative examples, respectively, and FP and FN represents the number of incorrectly classified positive and negative examples, respectively. MCC gets the value of +1 when the classifier makes perfect predictions, -1 when the predictions and actual values totally disagree, and 0 when the classification is no better than a random prediction. Principal component analysis visualization. Principal Component Analysis (PCA) is an unsupervised dimensionality reduction method which projects the samples onto a series of orthogonal axes while preserving as much as possible of the variation present in the dataset [33]. The reduced space consists of the linear combinations of the interrelated variables. We apply PCA to the Parkinson's disease dataset and projected samples onto the PCA subspace with 3 dimensions. We visualize the distributions of the samples for different UPDRS thresholds to see how well the samples above and below these thresholds are discriminated in this reduced space.
Clustering analysis. We perform an additional cluster analysis to validate the optimal UPDRS threshold results obtained with the binary classification approach. For this purpose, we use spectral clustering method which uses eigen-values of similarity matrix of the data to reduce the dimension as a pre-processing step before clustering [34]. In this method, each sample is represented with several components in the corresponding eigen-vectors of the similarity matrix and this reduced space is fed to another clustering algorithm. In this study, we construct the k-nearest neighbors similarity graph of the samples and feed this similarity matrix to k-means clustering method. We divide the samples into two clusters and analyze the UPDRS scores of the patients in each cluster for various UPDRS threshold values.

Feature ranking based on determined optimal UPDRS threshold
We aim to quantify the relevance of each vocal feature with the discretized UPDRS score using Mutual Information (MI) [35]. Thus, the vocal features which significantly change with respect to the level of motor systems disorders as determined by UPDRS will be identified. The MI approach used in this study is a filter method which aims to rank the features according to their relevance with the target variable without involving any classifier/regressor for evaluation [36][37][38].
The mutual information is a measure of mutual dependence of the two variables which can also capture non-linear relations. It is based on Shannon's entropy [35] which is a measure of the uncertainty of a random variable X and thus, it quantifies how difficult it is to predict that variable. The definition of Shannon's entropy can be written as an expectation: where p(x) = P(X = x) is the probability distribution function of X. Hence the Shannon's entropy represents the uncertainty removed after the actual outcome of X is revealed. MI is a measure of mutual dependence of the two variables based on the entropy: MI is also the KL divergence of the product P(X)P(Y) of the two marginal probability distributions from the joint probability distribution, P(X,Y).
where p(x,y) = P(X = x, Y = y). As described in Materials and methods, the features of the PD dataset are continuous. Therefore, to compute the MI scores between the features and the discretized UPDRS, we discretize the features to 9 discrete levels [36,38]. For discretization, for each feature, we use its mean μ and its standard deviation σ as in [37]. The feature values between μ-σ/2 and μ+σ/2 are converted to 0. The 4 intervals of size σ to the right of μ+σ/2 are converted to discrete levels from 1 to 4 and the 4 intervals of size σ to the left of μ−σ/2 are mapped to discrete levels from −1 to −4. Very large positive or negative feature values are truncated and discretized to ±4 appropriately.

Discrimination between healthy subjects and patients with UPDRS below threshold
After determining the optimal UPDRS threshold value that can be discriminated using the vocal features, we exclude the samples of the patients whose motor UPDRS score is above this threshold and created a new dataset consisting of the samples of PD patients whose UPDRS score is below this threshold and 8 healthy subjects. The aim of this analysis is to evaluate the effectiveness of speech features in discriminating the early stage PD patients and healthy subjects. We use SVM, ELM, and k-NN classifiers and present the accuracy and MCC of each classifier for different parameter values and kernel types.

Determination of optimal UPDRS threshold
Binary classification problem. We first normalize the features of the PD dataset so that each has a zero mean and unit variance. Then, the features are fed into SVM, ELM and k-NN classifiers for various motor UPDRS threshold values. We use 70% of the samples for training and the rest for validation. For k-NN classifier, we use Euclidean, city-block, and correlation as distance metrics, varying the number of nearest neighbors (k) from 1 to 9. We present the k-NN results only with city-block distance since it performed better than or comparable to the other distance metrics. For SVM classifier, we use LIBSVM implementation [39] with linear, polynomial, and Radial Basis Function (RBF) kernels, varying the cost (C) parameter from 0.1 to 10 increasing by 0.2, polynomial degree from 1 to 5, and kernel width (g) from 0.01 to 1 increasing by 0.02. As it is seen in Fig 1, depending on the value of the UPDRS threshold value, the PD dataset becomes highly imbalanced, and SVM classifier tends to label the samples as majority class to minimize the error on the training set. We use the "class-weight" parameter of LIBSVM, w, to address the class imbalance problem. The class weight parameter of SVM is used to increase the cost of errors made on the samples of minority class during training. In the implementation of ELM classifier, we vary the number of hidden neurons from 50 to 200 with sigmoid, sine, RBF, and triangular basis activation functions.
We present the classification performance of k-NN using cityblock distance, ELM with sigmoid, sine, triangular basis, and RBF activation functions, and SVM with linear and Radial Basis Functions (RBF) kernels for different UPDRS thresholds. In Figs 2-4, the accuracies and MCC values obtained with the optimal parameter values of the classifiers are shown. As it is seen in Figs 2 and 4, the accuracies of both k-NN and ELM classifiers decreases as UPDRS threshold increases up to around 23 and then begins to increase. The highest overall accuracies (~90%) with both of these classifiers are obtained when UPDRS threshold is set to 11 and 32. However, as it is seen in Fig 1, the binary classification problems obtained by setting UPDRS threshold to these values are highly imbalanced (e.g. in test set number of positive examples is 1868 whereas number of negative examples is 207 when UPDRS threshold is set to 11), and using accuracy on such imbalanced datasets may lead to false inferences regarding the success of classifiers [40,41]. For example, with the class distribution corresponding to UPDRS threshold of 11, a simple strategy of labeling all the test set examples as positive class gives an accuracy of 90.02%. Indeed, it is seen that the accuracy plots of k-NN and ELM classifiers are not in compatible with their MCC plots. The MCCs of k-NN and ELM increase as UPDRS threshold increases up to around 15 and thereafter shows a decreasing trend. These results show that k-NN and ELM tend to label most of the examples as the majority class for the UPDRS thresholds resulted in imbalanced dataset. On the other hand, as seen in Fig 3, the accuracy and MCC performances of SVM-linear and SVM-RBF classifiers change similarly with respect to the UPDRS threshold value. These results show that SVM performs more consistently than k-NN and ELM on imbalanced datasets when the costs of errors made on the training samples of majority and minority class are tuned well using its class weight parameter. As seen in Fig 2, k-NN performs the highest MCC with k = 7 (city-block distance) when UPDRS threshold is set to 15. Similarly, SVM and ELM show their best performance with RBF kernel and sine activation function, respectively, when UPDRS threshold is set to 15. In Fig 5  (left), MCC obtained with the optimal parameter set of each classifier with respect to UPDRS threshold is shown. These results show that UPDRS value of 15 is the optimal threshold value that can be used to monitor the progression of the disease as a classification problem. Fig 5  (left) shows the MCC of all classifiers obtained with their best settings with respect to UPDRS threshold. It is seen that ELM with sine activation function gives the highest MCC (0.5219). The corresponding test set accuracy of ELM is 83.70%. The highest MCC obtained with  SVM-RBF (0.4421) is higher than that of k-NN (0.3439). The corresponding accuracies of SVM-RBF and k-NN are 77.59% and 78.64%, respectively. In Fig 5, the performance of ensemble classifier obtained by combining the predictions of SVM-RBF and ELM-Sine using hardvoting combination strategy [42] is also shown. It is seen that while the performance of SVM-RBF improves when combined with ELM-Sine, ELM-Sine individually performs better than or comparably to the ensemble of SVM-RBF and ELM-Sine. On the other hand, the ROC space of the classifiers shown in Fig 5 (right) obtained when UPDRS threshold is set to 15 shows that although MCC of ELM-Sine is higher than that of SVM-RBF, SVM-RBF is more balanced in correctly classifying the positive and negative instances.  Vocal features in early telediagnosis of Parkinson's disease Principal component analysis visualization. We apply principal component analysis (PCA) and reduce the dimensionality of the dataset to visually validate the results obtained with the designed binary classification problem. Fig 6 shows the scatter of PD data on the first three principal components with various UPDRS threshold values. As seen in Fig 6, the projections of the samples belonging to the patients with lower UPDRS score than 15 form a cluster in a region of the PCA subspace. Although they mostly overlap with the other group with UPDRS score higher than 15, it is seen that there is a distinct set of samples with UPDRS score above 15 that does not overlap with the other group. However, with increasing UPDRS threshold, it is seen that number of overlapping samples from the two groups increases. Fig 6 also shows that when UPDRS threshold is set to 25, none of the groups form a separate cluster in any specific region of the PCA subspace. These results validate the findings explored with the binary classification problem.
Clustering analysis results. We apply spectral clustering with the settings described in Materials and methods to divide the dataset into two groups in an unsupervised manner. The aim is to compare the UPDRS values of the patients that fall into these two groups. For this purpose, we divide the samples into two clusters by constructing the k-nearest neighbors similarity graph of the samples and feeding this similarity matrix to k-means clustering method. Then, we analyze the UPDRS scores of the patients in each cluster for various UPDRS threshold values.
The clustering algorithm groups 4307 of the speech recordings in cluster 1 and the rest (1586 recordings) in cluster 2. After obtaining the cluster indexes, for each cluster we compute the ratio of the number of patients whose UPDRS score is below the corresponding threshold to the number of all patients in that cluster. Fig 7 shows the absolute difference between the ratios of cluster 1 and cluster 2 for various UPDRS thresholds. It is seen that the highest difference is obtained for UPDRS threshold value of 17 which is very close to the UPDRS threshold of 15 determined with binary classification problem and PCA analysis.

Feature ranking based on optimal UPDRS threshold
After determining the optimal UPDRS threshold value and converting the UPDRS prediction problem to a binary classification problem using the optimal threshold, we quantify the relevance of each vocal feature with the discretized UPDRS score to reveal which vocal features are related with the severity of UPDRS score. For this purpose, we calculate the Mutual Information (MI) between each of the vocal feature and the discretized UPDRS value.
The ranking of vocal features based on their MI score with the target variable is shown in Table 3. As seen, Detrended Fluctuation Analysis (DFA) which represents the signal fractal scaling exponent is the most effective feature in discriminating the patients with severe motor systems disorders from those who have relatively less severe motor system disorders. DFA is followed by Pitch Period Entropy (PPE), Recurrence Period Density Entropy (RPDE), and Harmonics to Noise Ratio (HNR), respectively. We should note PPE is found as one of the most relevant vocal features in discriminating the healthy subjects from PD patients in [21] on a PD dataset consisting of speech recordings of 23 patients and 8 healthy subjects. However, the findings in [21] show that Jitter:DDP, Shimmer, Shimmer(dB), and Shimmer:APQ5 are more important in healthy subject/PD patient discrimination problem than DFA, RPDE, and HNR.

Discrimination between healthy subjects and patients with UPDRS below threshold
The supervised learning problem (binary classification) has been designed to determine the UPDRS value after which speech disorders begin to emerge. The unsupervised approaches (PCA analysis and spectral clustering) results have validated the determined UPDRS threshold value. In the second step, we exclude the samples of the patients whose motor UPDRS score is above this threshold and include the samples of healthy subjects. Thus, we create a new dataset consisting of all speech recordings of the healthy subjects and 1607 speech recordings of the PD patients. Then, we apply k-NN, SVM and ELM classifiers with various settings using 70% of dataset for training and the rest for validation to evaluate the effectiveness of vocal features in discriminating the early stage PD patients and healthy subjects. For statistical significance, this procedure is repeated 100 times with random training/test partitions. We present the average and standard deviation of the accuracies of these runs for each classifier in Table 4. The Table 3. Ranking of the vocal features based on their mutual information with UPDRS level discretized according to the determined optimal threshold that can be discriminated by machine learning methods.

Ranking
Dysphonia Measurement MI Score Vocal features in early telediagnosis of Parkinson's disease best results obtained with k-NN, SVM and ELM are given in Table 5 along with the p-values of paired t-test for each pair of classifiers. As it is seen, the accuracy (96.43%) and MCC (0.77) of SVM with third-degree polynomial are significantly higher than that of the other methods. It is also seen that correlation distance based k-NN significantly outperformed ELM in terms of MCC which shows that it has produced more balanced success rates on positive and negative instances. On the other hand, the lowest accuracy is obtained with liner kernel SVM which is the only linear classifier used in this study. We should also note that healthy subjects and patients with UPDRS score lower than 15 are better discriminated with vocal features than the patients with below and above determined UPDRS score are.

Conclusions
Considering that PD mostly targets the elderly people whose physical visits to the clinic are inconvenient and costly, there is an increasing motivation to develop PD telemonitoring and telediagnosis systems which are self-administrated and do not require the patient's visit to the clinic. Since the vocal impairments are one of the most commonly seen PD signs in the early stages of the disease, the PD telediagnosis and telemonitoring systems based on speech tests result in reliable diagnosis and motor UPDRS tracking systems. However, the patient data used in the existing telediagnosis systems include speech recordings of not only early PD patients with mild speech impairments but also PD patients with moderate and severe speech impairments who already suffer from some other symptoms and presumably have been diagnosed before. In this paper, we aim to assess the effectiveness of vocal features for early telediagnosis of PD in a more realistic scenario. First, as a preprocessing step, we first determine the group of patients with relatively greater speech impairments using Unified Parkinson's Disease Rating Scale (UPDRS) as an index of disease progression. For this purpose, we discretize the UPDRS scores of PD patients into two classes, "Below threshold" and "Above threshold", for various motor UPDRS threshold values, and for each case apply a binary classification procedure to discriminate the PD patients having UPDRS values below the determined threshold, labeled "negative", and above the determined possible threshold, labeled "positive". The UPDRS value resulting the highest classification performance is chosen as the UPDRS threshold value, after which the speech disorders are more significantly seen in the patients. We validate the determined threshold value with two unsupervised approaches: principal component analysis (PCA) and spectral clustering. The experimental results show that speech disorders are more significantly seen in the PD patients whose UPDRS exceeds 15.
Considering that the motor UPDRS ranges from 0 to 108, relatively low UPDRS threshold of 15 shows that vocal impairments can be used as early indicators of the disease. The highest Matthew's correlation coefficient (MCC) is obtained using support vector machines (SVM) with radial basis functions (RBF) kernel, which also gives higher MCC values than k-nearest neighbors (k-NN) and SVM with linear kernel for all UPDRS threshold values. Besides, we should also note that SVM performs more consistently than k-NN and extreme learning machines classifiers on PD dataset with imbalanced class distribution when the costs of errors made on the training samples of majority and minority class are tuned well using its class weight parameter. The mutual information based filter feature ranking analysis show that nonlinear feature extraction methods named as detrended fluctuation analysis and pitch period entropy are the most effective speech features in discriminating the patients with severe motor systems disorders from those whose motor system disorders are relatively less severe. The visual inspection presented using PCA also shows that simple lines or hyperplanes cannot discriminate the two groups from each other. These results strongly indicate the nonlinearity behavior of the problem and, therefore, it needs to be solved by a nonlinear model. In the second step, to address the main goal of this paper, we exclude the speech recordings of the PD patients having higher UPDRS score than the determined threshold in the first step and create a new dataset consisting of the samples of PD patients whose UPDRS is below the determined threshold value and healthy subjects. Thus, we assess the PD telediagnosis ability of vocal features in a more realistic scenario for clinical use. We feed this dataset into three classifiers and present the detailed results. For best generalization, the complexity of the classifier should match the complexity of the function underlying the data [43]. More complex models than the underlying function can lead to overfitting in which the model identifies random noise in the data, rather than a true signal of clinical use (Sachs, 2015), whereas models that are less complex than the function can lead to underfitting. Therefore, to evaluate the generalization ability of the classifiers, the hyperparameters such as the number of nearest neighbors of k-NN or degree and cost of polynomial kernel-SVM should be optimized on a separate set that has not been used during training. For this purpose, we train three classifiers with various settings using 70% of dataset for training and use the rest for validation to evaluate the effectiveness of vocal features in discriminating the early stage PD patients and healthy subjects. For statistical significance, this procedure is repeated 100 times with random training/test partitions and paired t-test is applied to test the statistical significance of the results. The highest accuracy of 96.4% and Matthew's Correlation Coefficient of 0.77 is obtained using SVM with third-degree polynomial kernel. Lower and higher degree values than this optimal value cause less accuracies due to the model's simplicity and overfitting problems, respectively. These results show that speech features are effective in discriminating PD patients with mild speech impairment from healthy subjects and can be used as a decision support system for early telediagnosis of the disease. The success of SVM with nonlinear kernels in PD classification problem is not surprising as it has already been shown in the literature [21,23,44]. Considering that many of the speech signals are noisy [23], we can conclude that, as shown on related audio processing problems from different domains [45][46][47][48], SVM with a non-liner kernel produces more generalizable models which are robust to noise and outliers compared to many classification algorithms such as those used in this study. On the other hand, we should note that the lowest accuracy is obtained with liner kernel SVM which is the only linear classifier used in this study. This is mainly because non-linear relationships between the vocal features in UPDRS prediction are overlooked by a linear method.
We should note that using motor UPDRS score as the index of disease progression instead of more specific UPDRS subscores representing the severity of speech and other disorders is a limitation of this study. As a future research direction, a dataset containing UPDRS subscores may be collected and used as the index of disease progression to better identify the patient group having mild motor system disorders.