Effective Dysphonia Detection Using Feature Dimension Reduction and Kernel Density Estimation for Patients with Parkinson’s Disease

Detection of dysphonia is useful for monitoring the progression of phonatory impairment for patients with Parkinson’s disease (PD), and also helps assess the disease severity. This paper describes the statistical pattern analysis methods to study different vocal measurements of sustained phonations. The feature dimension reduction procedure was implemented by using the sequential forward selection (SFS) and kernel principal component analysis (KPCA) methods. Four selected vocal measures were projected by the KPCA onto the bivariate feature space, in which the class-conditional feature densities can be approximated with the nonparametric kernel density estimation technique. In the vocal pattern classification experiments, Fisher’s linear discriminant analysis (FLDA) was applied to perform the linear classification of voice records for healthy control subjects and PD patients, and the maximum a posteriori (MAP) decision rule and support vector machine (SVM) with radial basis function kernels were employed for the nonlinear classification tasks. Based on the KPCA-mapped feature densities, the MAP classifier successfully distinguished 91.8% voice records, with a sensitivity rate of 0.986, a specificity rate of 0.708, and an area value of 0.94 under the receiver operating characteristic (ROC) curve. The diagnostic performance provided by the MAP classifier was superior to those of the FLDA and SVM classifiers. In addition, the classification results indicated that gender is insensitive to dysphonia detection, and the sustained phonations of PD patients with minimal functional disability are more difficult to be correctly identified.


Introduction
Dysphonia is a type of phonation disorder with an impairment in the ability to produce normal voice sounds [1]. Manifestation of dysphonic voice is characterized by hoarseness or weakness in phonation [2]. As the functional causes of dysphonia, neurological disorders sometimes make neurogenic interruptions in the laryngeal nerve paths that could interfere in normal vibration of vocal folds during exhalation [3]. Dysphonia is detrimental to quality of life, because the speech impaired patient often encounters difficulty in personal communication that leads to depression and further social handicap [4]. A large number of patients with idiopathic Parkinson's disease (PD) suffer from dysprosody and dysarthria [5]. According to the survey of Hartelius and Svensson [6], over 70% of the PD patients experienced speech deficit and voice impairment after the onset of their disease, and only 3% of the patients had received speech therapy. Ho et al. [5] utilized the clinical-perceptual method to study the speech difficulties in PD. They sampled the two-minute conversational speech of 200 PD patients, and examined the speech deficit profiles (i.e., voice, articulation, and fluency) [5]. Their study showed that voice was the leading deficit in the initial stage of PD, and articulatory and fluency deficits manifested in the severe stage of PD [5].
Quantitative measures of speech impairment could help assess the severity levels of speech impairment in PD patients and study the specific impaired speech parameters [7]. The simultaneous qualitative and quantitative investigations are able to characterize the exaggerated vocal tremor, weak voice, roughness, and other dysphonic symptoms in idiopathic PD, which sometimes would be confused with spasmodic dysphonia (a laryngeal abnormality characterized by spasm of the vocal cord) in clinical diagnosis [8].
Recently, telemedicine systems with advanced network access have been effectively used for remote monitoring of patients with vocal impairment [9]. The telemedicine technology provides relatively low-cost clinical monitoring solutions that help reduce frequent physical visits for patients [10]. As suggested by Little et al. [10], such telemedicine systems call for more reliable clinical tools and speech measurements for accurate detection and monitoring of vocal symptoms in PD.
A number of novel speech measurement methods have been developed to assess dysphonic symptoms in the last decade [2,3,[11][12][13]. The purpose of such speech measurements is to characterize the features of the acoustic signals associated with phonation disorders. Impairment of vocal folds often causes the irregular movement in one or both sides of the glottis that leads to pathological vibration patterns, such as pitch frequency fluctuations, changes of airflow volume, and amplitude alteration [2]. Thus, dysphonia is often observed in the production of vowel sounds. The fundamental frequency (F0) in vowels, mean of F0, variation of F0 (jitter), the variation of speech amplitude (shimmer), intensity from one vocal cycle to another are the most frequently used electroglottographic measures in standard speech tests [14][15][16][17]. Zwirner and Barnes [14] reported that the standard deviation of F0 in prolonged vowels is much larger for PD patients compared with healthy control subjects. The study of Hertrich and Ackermann [15] indicated increased jitter and higher mean of F0 in prolonged vowels for PD patients. Goberman et al. [16,17] compared the acoustic-phonatory characteristics of Parkinsonian speakers before and after taking medication, and found greater F0 variability and decreased intensity range (difference between the loudest and quietest prolongations) in the disfluent speech of PD patients. Cnockaert et al. [18] used the wavelet analysis technique to extract the phonatory frequency trace and low-frequency vocal modulation in sustained vowels. Their study suggested that the average phonatory frequency is significantly higher for male subjects with PD, and the modulation amplitude is significantly larger for female PD patients [18]. Recently, nonlinear dynamics analysis tools have been utilized to study the vocal oscillation patterns in Parkinsonian phonatory impairment. Rahn et al. [19] employed the phase space reconstruction and correlation dimension methods to measure the perturbation nature in the aperiodic voices of idiopathic PD. Their results showed that the correlation dimension values are significantly higher in PD patients than those of control subjects, which implies an increased complexity of phonatory signals in PD vocal pathology [19]. In order to overcome the limitations of vocal perturbation measurement in the time scale and Fourier transform in the frequency domain, Little et al. [13] proposed the time-delay state-space recurrence analysis and the fractal scaling analysis to exploit the nonlinearity in pathological phonation voices. The detrended fluctuation analysis and quadratic discriminant analysis methods were used to investigate the self-similarity properties (in terms of scaling exponent) of vocal fluctuations associated with PD [10,13]. In the present work, we aim to study the mutual correlations among the different parameters in vocal measurements, and also to develop an effective method for the vocal pattern classification. It is hypothesized that the most informative features with regard to fundamental frequency, amplitude variability, and dynamics in vocal fluctuations could be properly selected and used in the nonlinear analysis methods for the accurate classification of PD patterns.

Dataset Preparation
The data set used in this study was donated by Little et al. [10], and is also online available via University of California at Irvine (UCI) machine learning repository [20]. The phonation data contain 195 sustained vowel records uttered by total 31 subjects. There were 8 healthy control subjects (3 males and 5 females), with the averaged age of 60.2 years (standard deviation: 8.6 years), participating in the speech tests. The PD patients included 16 males and 7 females (mean and standard deviation of age: 67.869.7 years). The disease stage of each PD patient was assessed with the Hoehn and Yahr (H&Y) scale [21], a widely used PD progression rating method in clinical practice. Figure 1 shows the detailed numbers of the PD patients with different H&Y stages. It can be observed that majority (82.6%) of the PD patients underwent the intermediate course of the disease (1,H&Y,3.5).
The vocal vowels were recorded using a head-mounted microphone positioned at 8 cm from the lips. The microphone was calibrated with a Class 1 sound level meter (Brüel & Kjaer Type 2238 Mediator) placed 30 cm in front of each subject [10]. The acoustic signals were digitized with the resolution of 16 bit and the sampling rate of 44.1 kHz, and the signal samples were normalized in amplitude [10]. Little et al. [10] implemented the Kay Pentax multidimensional voice program (MDVP) to measure the perturbations in the sustained vowel records. Such perturbation measures include the period (jitter) and amplitude (shimmer) perturbations and harmonics-to-noise (and noise-to-harmonics) ratios. They also computed the nonlinear parameters such as correlation dimension (D2), recurrence period density entropy (RPDE), detrended fluctuation analysis (DFA), and pitch period entropy (PPE) [10]. Based on the vocal perturbation and nonlinear measures, the standard support vector machine (SVM) was applied to distinguish the normal and pathological vocal patterns [10]. The data set also contained two nonlinear measures of fundamental frequency variation: Spread1 and Spread2. All the subjects recruited in the experiments of Little et al. [10] provide their written informed consent as supervised by University of Oxford, United Kingdom, and U.S. National Center for Voice and Speech, Denver, Colorado. The data analysis methodology documents of this study were approved by the Institutional Review Board of Xiamen University. In particular, the Shimmer: DDA and Shimmer: APQ3 measures exhibit completely collinear relationship (the correlation coefficient equal to 1). Little et al. [10] first normalized the feature values in the numerical range [21,1] to ameliorate the classification performance of support vector machine (SVM). Then they searched through the pairs of highly correlated measures (the correlation coefficient larger than 0.95), and removed an arbitrary measure in each pair [10]. The correlation filtering procedure excluded the following measures: MDVP: Jitter (%), MDVP: RAP, MDVP: PPQ, MDVP: Shimmer, MDVP: Shimmer (dB), Shimmer: APQ3, Shimmer: APQ5, with the remaining ten measures for further SVM classification [10].

Feature Analysis
In this investigation, we performed the sequential forward selection (SFS) method [22] to select the dominant measures and exclude the similar measures that contribute the redundant information. The logistic regression [23] was employed in the SFS selection process to evaluate the performance. The SFS is a greedy searching algorithm that starts from an empty feature set, and then sequentially adds and combines with the features to maximize the logistic regression performance. The feature set obtained by the SFS method included: MDVP: F0, MDVP: Jitter (%), DFA, Spread2.
We also implemented the kernel principal component analysis (KPCA) to project the SFS features onto the two-dimensional mapping space. The PCA makes the orthogonal transformation to convert multivariate measures into some linearly uncorrelated principal components. The KPCA is a kernel-based extension of PCA method that conducts a nonlinear feature mapping in the kernel Hilbert space [24]. In the present work, the KPCA was performed with the polynomial kernel which can be expressed as [23]: where x represents the vector of SFS-selected vocal measures, d denotes the polynomial order, and t is the intercept. We searched the polynomial kernel parameters in the range from 1 to 10, and chose d~3 and t~1, which could make a maximum Euclidean distance of vocal patterns between the healthy control and PD subject groups. Feature density estimation. In the present work, we used the nonparametric kernel density estimation technique in order to provide the vocal pattern distribution of the KPCA-mapped features. The principle of the kernel density estimation uses the finite observed pattern scatters to approximate the nature of class distributions. Let the vector m~m 1 ,m 2 ½ T denote the KPCA feature set, where m 1 and m 2 are used to express the first and second principal components. The class label of the vocal pattern is represented as v, with v CO and v PD denoting the groups of healthy controls and PD patients, respectively. Based on the N vocal patterns from a particular subject group, the kernel density estimation method can approximate the class-conditional feature density with kernel functions as [25] The bivariate Gaussian kernel function is presented as where m n~mn 1 ,m n 2 Â Ã T indicates the center location of the n-th vocal pattern in the subject group v.
Because the KPCA produces the orthogonal projection for the principal components, the correlation coefficient between the first and second principal components is equal to zero. Hence, the covariance matrix that determines the spread width of the Gaussian kernel function is a diagonal matrix, written as S~diag ls 2 1 ,ls 2 2 À Á , in which s 2 1~1 :37|10 13 and s 2 2~2 :55|10 9 are the variances of first and second principal components, respectively. The scaling factor l is used to coordinate the spread area of the Gaussian kernel function in the KPCA feature space. We searched the scaling factor in the numerical range [0, 1] with an increment of 0.01, and selected l~0:1 that could make the best contour resolution of estimated KPCA feature densities in the present study. Regulated by the scaling factor parameter l, the 2by-2 diagonal covariance matrix became S~diag 1:37| ð 10 12 ,2:55|10 8 Þ. In the computer experiment, the covariance matrix was unique for both groups of healthy controls and PD patients.

Vocal Pattern Classification
With the estimated class-conditional densities of KPCA features, we employed the maximum a posteriori (MAP) rule (also referred to as Bayes decision rule) [26], to perform the classification of vocal patterns. In the present work, a posteriori probability P(vjm) indicates the possibility of a vocal pattern with its observed feature vector m belonging to either healthy control or PD voice record group v. According to Bayes formula, a posteriori probability P(vjm) can be computed from the class-conditional probability density p(m v j ) as where the class-dependent a priori probability P(v) presents the possible occurrence of a particular vocal record group. The MAP classifier recognizes the observed vocal pattern m belonging to the PD record if its a posteriori probability P(v PD m j )wP(v CO m), vice versa.
In addition to the MAP classifier, we also implemented the SVM for classification performance comparison. The SVM is a kernel-based artificial neural network, which trains the network parameters to minimize the structural risk [27]. The SVM is able to perform the same function as the multilayer neural network (such as multilayer perceptron or radial basis function network), by  j j j choosing the corresponding nonlinear inner-product kernels. During the SVM parameter optimization, the training data which geometrically locate close to the decision boundary will be selected as the support vectors, which are considered to be informative for the classification. The SVM learning can be formulated as the following constrained quadratic programming problem with respect to the convex cost function [27][28][29]: where w and e are the weight and error vectors, c is a positive real constant, and f l denotes the KPCA-mapped feature vector of the l-th vocal pattern. In the present work, we compared the performance of the SVM with polynomial, sigmoid, and radial basis function (RBF) kernels, and then chose the standard RBF the support vectors. We also employed the Fisher's linear discriminant analysis (FLDA) to perform the binary classification of vocal patterns. The FLDA does not require the assumptions that the patterns of different groups are with the normal distributions or equal class covariances. The FLDA searches the parameter vector w that maximizes the class separability in the feature space to perform the linear discriminant as [23,30] arg max where T is the within-class scatter matrix as the sum of intra-class variances, and S B~( u CO {u PD )(u CO {u PD ) T is the between-class scatter matrix.
In the present study, we implemented the 5-fold cross-validation method to evaluate the classification performance for each classifiers. The whole data set was partitioned into five disjoint subsets. Four subsets were used to train the classifiers, and the remaining subset was used for testing. The procedure was repeated for five trials, each time using a different subset for validation.

Results
The SFS method selects the MDVP: F0, MDVP: Jitter (%), DFA, and Spread2 as dominant measurements. It can be observed from Table 1 that four vocal measurements possess different metric range. For example, the difference of mean values between CO and PD records for MDVP: F0 is with a much larger order of magnitudes than that for MDVP: Jitter (%), although both of these two measurements present the vocal perturbations in fundamental frequency. In previous work of of Little et al. [10], all of these vocal measurements were normalized to diminish the influence of In our experiments, the KPCA was applied to reduce the feature dimensions by projecting four vocal measurements onto a bivariate space, in which the vocal patterns with the KPCAmapped features also exhibit distinct scatter distributions. Figure 3 provides the estimated vocal pattern densities of CO and PD groups in the KPCA-mapped feature space. In accordance with the scatters located in Fig. 4, the aggregation area of vocal patterns associated with PD patients shows a high density in red. On the other hand, the vocal patterns of CO group possess multimodal density characteristics. As depicted in Fig. 4, majority of CO vocal patterns (30 records) disperses at the bottom left side of the feature space. In addition, about a quarter number of CO vocal patterns (18 records) converges as a small cluster in the lower right corner (see the high density area in blue color). The estimated feature densities make the vocal pattern distribution visible in the bivariate space. It is worth noting from Fig. 5B that the MAP classifier outperforms the other two classifiers with higher degrees of accuracy, area under ROC curve, and sensitivity. Such results imply that the MAP classifier has the superiority in recognition of PD voice records over the SVM and FLDA. On the other hand, the SVM classifier produces a higher specificity rate than either the MAP or FLDA, which indicates some merits for categorization of CO voice records. In general, the nonlinear classification (by means of the MAP or the SVM) is better than the linear classification (by means of the FLDA). The FLDA does not achieve the results obtained by either of the nonlinear classifiers, in any of classification evaluation criteria (i.e., accuracy, area under ROC curve, specificity, and specificity). Table 2 lists the detailed subject information related to the voice records misclassified by the SVM and MAP classifiers. Only one PD patient's speech was not accurately identified by the MAP classifier. Noting that the PD patient is with minimal functional disability (with H&Y stage 1), such a misclassification could be tolerated in clinical applications. A number of the voice records misclassified by the MAP classifier were spoken by the subjects S43, S49, and S50. Someone may suppose whether the MAP classification tends to be subject-dependent. But we observe that the same records are misclassified by the SVM classifier too. Such results, in our opinion, are not persuadable to infer that the MAP classification is subject-dependent. For the SVM classification, on the other hand, the voice records of more subjects are not correctly  detected. Some subjects are with mild to moderate disability (with H&Y stages 2 through 3), which implies the weakness of the SVM classifier for pathological voice detection. We assume that the limited size of data set could be one possible cause. The SVM classifier needs to select informative support vectors to construct the decision boundary, therefore a small number of phonation data would result in a bias of the decision making. In addition, the number of male subjects is not significantly different from that of female subjects in either of two misclassification lists, which indicates that the gender is insensitive in the pathological voice detection for Parkinson's disease.

Discussion
The selected features: MDVP: F0, MDVP: Jitter (%), DFA, and Spread2 provided the useful information about pathological voice in different clinical aspects. The fundamental frequency F0 quantifies the vibration frequency of the vocal folds. The period perturbation measure jitter corresponds to the cycle-to-cycle variation in fundamental frequency. The interruptions caused by Parkinson's disease in the nervous paths could result in neurogenic paralysis of the recurrent laryngeal nerves, the superior laryngeal nerves, or the vagus nerves. The irregular vibration of the vocal folds would change the mean of F0, frequency variability (jitter), and speech amplitude, which could be measured in the phonation course of a sustained vowel. On the other hand, the DFA is used to describe the stochastic self-similarity properties of the noise caused by turbulent airflow in the vocal tract. Breathiness and other dysphonic voice caused by incomplete vocal fold closure would lead to an increase of the DFA value [9]. The nonlinear dynamical complexity parameter Spread2 can also characterize the extent of turbulent noise in the acoustic signal [10]. The Spread2 value is quite strongly associated with the dysphonia response. The present study demonstrated the predominant contributions of these four features to the analysis of PD vocal patterns. Figure 6 plots the scattered vocal patterns with pairs of the selected features. It is worth noting that the vocal patterns associated with the healthy controls and PD patients are still overlapping in the two-dimensional feature spaces. Among these combinations of the selected feature pairs, the feature pairs of MDVP: F0-DFA and DFA-Spread2 could provide relatively better separable pattern distributions in Fig. 6B and 6F. We validated the performance of the MAP classifier with these two feature pairs, using the 5-fold cross-validation method. The accurate classification rates were 85.1% (ROC area: 0.9) and 85.6% (ROC area: 0.93) for the MDVP: F0-DFA and DFA-Spread2 feature pairs respectively, which were worse than the results obtained with the KPCA-based features. It is clear that the KPCA method can project the selected vocal features, with the nonlinear kernels, onto the visible bivariate space toward superior separability and decision interpretation.
The present study does not require the input data normalization procedure. Little et al. [10] implemented the rescaling of feature values in the numerical range from 21 to 1, with the motivation of improving the SVM classification performance. Such data preprocessing, in our opinion, may cause some obstacles in data analysis. First, the rescaling or normalization is not robust for the data set of small size (the total number of voice records lower than 200 in the data set). Additional recruited voice records which exceed the current extreme of feature values would require another rescaling, such that the SVM classifier should be retrained which consumes much more computation time. On the other hand, the physical magnitude information about the voice measurements would be lost after the data normalization. It is therefore not convenient for medical experts to use the data located around the discriminant boundary, for example the support vectors, as the important indicators for screening of pathological voice records. In addition, without the data normalization, the MAP classifier is able to achieve the overall accuracy of 91.8%, which is better than the previous related work (91.4% accuracy obtained by the SVM with ten normalized features) of Little et al. [10], and also comparable to the results (92.8% accuracy) performed by the SVM with bootstrap resampling of data in the work of Sakar and Kursun [3].

Conclusion
Effective dysphonia detection provides more quantitative analysis of phonation disorders, toward better medical or behavioral treatments for speech improvement. In the present study, we studied the correlation matrix of the vocal measures that indicate the period, amplitude, and nonlinear perturbations in sustained vowel records. The dominant vocal measures of MDVP: F0, MDVP: Jitter (%), DFA, and Spread2 were selected by the SFS method, which could reduce the dimensions for pattern analysis. The nonparametric kernel density estimation method established the visible bivariate distributions of the KPCA-mapped feature densities. Based on the estimated feature densities, the MAP classifier was able to provide excellent classification performance, superior to that of the FLDA classifier and the SVM with RBF kernels. The experimental results demonstrated the merits of feature dimension reduction and kernel density modeling for vocal pattern analysis. The highest true positive rate (sensitivity) and minimal number of misclassified subjects also showed the effectiveness of the MAP classifier in the detection of phonation disorders. The relatively small size of phonation data limited the nonlinear classification capability of the SVM classifier. From the present work, we also conclude that the gender is not a sensitive factor for phonation disorders, and the PD patients with minimal functional disability are more likely to be incorrectly identified in the dysphonia detection. Noting that half of the SFSselected vocal measures were generated by the nonlinear dynamics analysis tools, it is believed that the development of nonlinear vocal oscillation measurements has high potential in monitoring the progression of phonatory impairments in future studies.