Particle Swarm Optimization Based Feature Enhancement and Feature Selection for Improved Emotion Recognition in Speech and Glottal Signals

In the recent years, many research works have been published using speech related features for speech emotion recognition, however, recent studies show that there is a strong correlation between emotional states and glottal features. In this work, Mel-frequency cepstralcoefficients (MFCCs), linear predictive cepstral coefficients (LPCCs), perceptual linear predictive (PLP) features, gammatone filter outputs, timbral texture features, stationary wavelet transform based timbral texture features and relative wavelet packet energy and entropy features were extracted from the emotional speech (ES) signals and its glottal waveforms(GW). Particle swarm optimization based clustering (PSOC) and wrapper based particle swarm optimization (WPSO) were proposed to enhance the discerning ability of the features and to select the discriminating features respectively. Three different emotional speech databases were utilized to gauge the proposed method. Extreme learning machine (ELM) was employed to classify the different types of emotions. Different experiments were conducted and the results show that the proposed method significantly improves the speech emotion recognition performance compared to previous works published in the literature.


Introduction
Speech utterances of an individual can provide information about his/her health state, emotion, language employed and gender. Speech is the one of the most natural form of communication between the individuals. Understanding of an individual's emotion can be useful for applications like web movies, electronic tutoring applications, in-car board system, diagnostic tool for therapists and call-center applications. Most of the existing emotional speech database contains three types of emotional speech recordings such as simulated, elicited and natural. Simulated emotions tend to be more expressive than real ones and most commonly used. For the elicited category, emotions are nearer to the natural database but if the speakers know that they are being recorded, then the quality will be artificial. Next, in natural category, all emotions may not be available and difficult to model because these are completely naturally expressed [1,2,3,4]. Most of the researchers have analyzed four primary emotions such as anger, joy, fear and sadness either in simulated domain or in natural domain. High emotion recognition accuracies were obtained for two-class emotion recognition (High arousal Vs Low arousal), but multi-class emotion recognition is still disputing. This is due to the following reasons: (a) which speech features are information-rich and parsimonious, (b) different sentences, speakers, speaking styles and rates, (c) more than one perceived emotion in the same utterance, (d)long-term/short-term emotional state.Several speech features have been successfully applied for speech emotion recognition and can be mainly classified into four groups such as continuous features, qualitative features, spectral features and non-linear Teager energy operator based features [1,2,3,4]. Various types of classifiers have been proposed for speech emotion recognition such as hidden Markov model (HMM), Gaussian mixture model (GMM), support vector machine (SVM), artificial neural networks (ANN) and k-nearest neighbor (kNN) classifier [1,2,3,4].
Although several research works have beenconducted in the field of speech emotion recognition, it is difficult to compare them directly due tothe inconsistency in the division of dataset, number of emotions used, number of emotional speech databases used, simulated/elicited/ natural speech emotional speech databases used and the lack of uniformity in the presentation and computation of the results. Most of the researchers have often used 10-fold cross validation, conventional validation (one training set + one testing set and speaker-dependent emotion recognition and achieved excellent performance. Speaker-independent multi-class emotion recognition is still a challenging taskdue to the higher degree of overlap among the speech features, irrelevant, redundant and noisy speech features. To improve the discrimination ability of the speech features and to select an optimal set of features with a modicum or no loss of emotion recognition accuracy, new methods was proposed in this work. Various wellknown speech features were extracted from the emotional speech and glottal signals. PSO is a population-based stochastic optimization method and several PSO variants have been proposed in the literaturefor function optimization, clustering and feature selection [5,6,7,8,9,10,11]. PSO based algorithms were proposed in this work, as it is very popular among researchers due to its simple mathematical operations, a small number of control parameters, quick convergence and ease of implementation [5,6,7,8,9,10,11]. PSO based clustering and wrapper based PSO were proposed to improve the discrimination ability of the extracted speech features and toenhance the accuracy of speaker-independent multi-class emotion recognitionby selecting only discriminative features respectively. First, PSO based clustering was applied on the extracted feature set to find the centriod or cluster centers. From the centers and means of the each feature, weights were calculated and multiplied with the original features to enhance their discrimination ability. From the weighted features, optimal feature set was found using the proposed wrapper based PSO in which three modifications were suggested. The proposed method has the following salient features: (1). Enhancement of discrimination ability of the extracted features using PSO based clustering; (2). Selection of optimal feature set thereby the performance of multi-class speech emotion recognition system has been improved.

Related Works
In [12], authors have proposed non-linear dynamic based features, prosodic and spectral features and used SVM classifier to classify seven emotions using the speech samples of Berlin emotional speech database (BES). They have achieved an emotion recognition accuracy of 82.72% for female speakers and 85.90% for male speakers using 10-fold cross validation. Nonlinear dynamic features and neural network classifier were used to classify three emotions (Neutral, fear and anger) and obtained a maximum emotion recognition accuracy of 93.78% for speaker dependent case [13]. Modulation based spectral features and multi-class SVM were used by the researchers in [14] to classify the seven classes of emotions and obtained a maximum emotion recognition accuracy of 85.60%. In [15], authors have used a combination of spectral excitation source features and auto-associative neural network and their emotion recognition accuracy was 82.16%. K. S. Rao et.al., have employed a combination of utterance-wise global and local prosodic features with SVM classifier and they obtained an emotion recognition accuracy of 62.43% [16]. In [17], authors have used LPCCs, formants and GMM classifier for the classification of seven emotions and the emotion recognition accuracy was 68%. Discriminative wavelet packet band power coefficients with Daubechies filter order of 40 and GMM classifier were used by Y. Li et. al., in [18] and obtained a maximum emotion recognition accuracy of 75.64%.Kotti M and Paternò Fhave proposed several low level audio descriptors and high level perceptual descriptorsand achieved a maximum emotion recognition accuracy of 87.7% under speaker independent case with Linear SVM [19]. MPEG-7 low level audio descriptors and SVM with radial basis function (RBF) kernel were used for the recognition of seven emotions and the emotion recognition accuracy was 77.88% [20]. In [21], Mel-frequency cepstral coefficients (MFCCs) and signal energy were computed as features. Correlation based feature selection with SVM-RBF kernel were used and this method was tested on the speech samples of Surry audio-visual emotional speech database (SAVEE). The emotion recognition accuracy was 79%. Intensity of energy, pitch, standard deviation, jitter and shimmer were extracted as features to classify the seven emotions using the audio samples of SAVEE database. They used kNN classifier and obtained a maximum emotion recognition accuracy of 74.39% [22]. Several speech features, linear discriminant analysis (LDA) based feature reduction and single component Gaussian classifier were employed to classify the seven emotions and achieved a maximum emotion recognition accuracy of 63% [23]. In [24], pitch, energy, duration and spectral based features were extracted and Gaussian classifier was used to classify seven emotions using the audio samples of SAVEE database. They achieved a maximum emotion recognition accuracy of 59.20%.
Though speech related features are widely used for speech emotion recognition, there is a strong correlation between the emotional states and features derived from glottal waveforms. Glottal waveform is significantly affected by the emotional state and speaking style of an individual [25,26,27,28,29,30,31,32]. Alexander I and Michael Shave investigated the effectiveness of glottal features derived from the glottal airflow signal in recognizing emotions. The average emotion recognition rate of 66.5% for all six emotions (Happy, Angry, Sad, Fear, Surprise and Neutral) and 99% for four emotions (Happy, Neutral, Angry and Sad) were achieved [25]. In [26,27,28], researchers have investigated the relationship between the emotional stages and the speech produced under stress, where glottal waveform was affected due to the excessive tension or lack of coordination in the laryngeal musculature. The effectiveness of the glottal features was analyzed in the classification of clinical depression by Moore et.al., [30,31]. In [32], authors have proposed the glottal flow spectrum as a possible cues for depression and near-term suicide risk and obtained 85% of the correct emotion recognition rate.Ling He et.al., have proposed wavelet packet energy entropy features for emotion recognition from speech and glottal signals with GMM classifier [33]. They achieved the average emotion recognition rates for BES database between 51% and 54%. In [34], prosodic, spectral, glottal flow, AM-FM features were utilized and a two-stage feature reduction was proposed for speech emotion recognition. The overall emotion recognition rates of 85.18% for gender dependent and 80.09% for gender independent were achieved using SVM classifier.

Materials and Methods
This section describes the materials and methods used in this work. We have derived MFCCs, LPCCs, PLPs,gammatone filterbank outputs, timbral texture features, SWT based timbral texture features and relative wavelet packet based energy and entropy based features from emotional speech signals and its glottal waveforms. To extract the glottal and vocal tract characteristics from the speech waveform, inverse filtering and linear predictive analysis were used [41,42,43,44,45]. Feature selection and enhancement are the inevitable tasks in any pattern recognition problem. Higher degree of overlap among the features of different classes may degrade the performance of speech emotion recognition system. To decrease the intra-class variance and to increase the inter-class variance among the features, PSO based clustering was suggested. Raw featureswere called as weighted features after applying feature enhancement algorithm using PSO based clustering. Curse of dimensionality is a challenging issue in any pattern recognition problem. In the field of speech emotion recognition research, several filter, wrapper and embedded based feature selection methods are available in the literature to solve the issue of curse of dimensionality [35,36,37,38,39,40]. In this work, PSO based feature selection to select the discriminativeweighted features. Both raw and weighted features were subjected to different experiments to validate their effectiveness in speech emotion recognition. Extreme learning machine with RBF kernel was used as classifier to recognize different emotions. Fig. 1 shows the block diagram of the proposed improved emotion recognition system using PSO based feature enhancement and feature selection from emotional speech signals and its glottal waveforms.

Emotional Speech Databases
In this work, three different emotional speech databases were used for emotion recognition to test the robustness of the proposed method.Berlin emotional speech database(BES) whic consists of speech utterances in German language. 10 professional actor/actresses were used to simulate 7 emotions (Anger-Ang, Boredom-Bor, Disgust-Dis, Fear-Fea, Happiness-Hap, Sadness-Sad, Neutral-Neu) [46]. Surrey audio-visual expressed emotion (SAVEE) database [24] is an audio-visual emotional database which includes seven emotion categories of speech and video signals (Anger-Ang, Disgust-Dis, Fear-Fea, Neutral-Neu, Happiness-Hap, Sadness-Sad and Surprise-Sur) from four native English male speakers aged from 27 to 31 years. 3 common, 2 emotion-specific and 10 generic sentences from 15 TIMIT sentences per emotion were recorded. In this work, only audio samples were utilized. Sahand Emotional Speech database (SES) was recorded at Artificial Intelligence and Information Analysis Lab, Department of Electrical Engineering, Sahand University of Technology, Iran [47]. This database contains speech utterances of five basic emotions (Neutral-Neu, Surprise-Sur, Happiness-Hap, Sadness-Sad and Anger-Ang) from 10 speakers (5 male and 5 female). 10 single words, 12 sentences and 2 passages in Farsi language were recorded which results in a total of 120 utterances per emotions. Table 1 gives the details of number of speech samples per emotion.

Feature Extraction for Speech Emotion Recognition
In the design of a speech emotion recognition system, extraction of most informative features for efficiently characterizing different emotions is still an open issue. Researchers have commonly used short-term features, called frame-by-frame analysis. As the emotional speech signals were recorded at different sampling frequency, all the emotional speech samples were down-sampled to 8 kHz for convenience. From the recorded the emotional speech signals, the unvoiced portions between words were removed by segmenting the down-sampled emotional speech signals into non-overlapping frames with a length of 32 ms (256 samples) based on the energy of the frames. Frames with low energy were discarded and the rest of the frames (voiced portions) were concatenated and used for feature extraction [33]. Then the emotional speech signals (only voiced portions) are passed through a first order low pass filter to spectrally flatten the signal and to make it less susceptible to finite precision effects later in the signal processing [48]. The first order pre-emphasis filter is defined as The commonly used a value is 15/16 = 0.9375 or 0.95 [48]. In this work, the value of a was set equal to 0.9375.Extraction of glottal flow signal from speech signal is a challenging task. In this work, glottal waveforms were estimated based on the inverse filtering and linear predictive analysis from the pre-emphasized speech waveforms. Mel-frequency cepstral coefficients (MFCCs). After pre-emphasis, the emotional speech signals/glottal signalsweresegmented into frames and windowed by Hamming window to  minimize the signal discontinuities and spectral distortion. The fast Fourier transform (FFT) wasapplied to calculate the spectrum of the each frame, followed by Mel-scaled mapping to get the spectrum in Mel domain.The Mel-frequency scale is linear frequency spacing below 1 kHz and a logarithmic spacing above 1 kHz. Logarithmic Mel spectrum was obtained by taking the logarithm value of the signal after the Mel filters. Finally, MFCCs were generated by using discrete cosine transform (DCT) for a frame [49]. After obtaining the MFCCs for each frame, they were averaged over all frames.Totally, 48 MFCCs features which include 24 MFCCs from emotional speech signals and 24 MFCCs from emotional glottal signals were extracted. Linear predictive cepstral coefficients (LPCCs). 36 LPCCs (18 LPCCs from emotional speech signals + 18 LPCCs from emotional glottal signals) were derived from LPC coefficients which are the coefficients of the Fourier transform representation of the log magnitude spectrum. The steps involved in the extraction of LPC coefficients are as follows: pre-emphasis, frame-blocking, windowing, autocorrelation analysis and conversion of autocorrelation coefficients to an LPC parameter set using Durbin's method [48].The suitable value of LPC order from 8 to 16 was found and fixed as 12.After obtaining the LPCCs for each frame, they were averaged over all frames.
Gammatone filterbank outputs (GTFBOs). Roy Patterson and his colleagues in 1992 originally proposed the Gammatone filterbank to provide a good approximation of human auditory filter and to visualize sound as a time-varying distribution of energy [50,51]. The preemphasised speech and glottal waveforms were fed into Gammatone filterbank. Twenty four Gammatone filterbank outputs were used in this work. A total of 48Gammatone filterbank outputs (24 for each emotional speech signals + 24 for each glottal waveforms) were derived for each emotional speech signals and its glottal waveforms.
Perceptual linear predictive (PLP) analysis. Itis a combination of short-term spectral analysis and LP analysis. It uses three basic concepts from the psychophysics of hearing concepts such as the critical-band spectral resolution, the equal-loudness curve and the intensityloudness power law to derive an estimate of auditory spectrum. Finally, this auditory spectrum was approximated by using the auto-correlation method of all-pole modeling and these autoregressive coefficients were transformed into cepstral parameters [52,53]. 26 PLP coefficients (13 PLP coefficients from emotional speech signals + 13 PLP coefficients from emotional glottal signals)were derived for each frame and they were averaged over all frames.
Timbral texture features (TTFs). Generally, timbral texture features were proposed for music-speech discrimination and speech recognition [54,55].The feature vector for describing timbral texture consists of the following features: spectral centriod, spectral flux, spectral rolloff, energy entropy, short-time energy and zero-crossing rate [54,55]. After obtaining the timbral texture features for each emotional speech and glottal signals, the following statistical parameters were computed such as standard deviation of timbral texture features, maximum by standard deviation of timbral texture features, maximum by median of timbral texture features, square of standard deviation by square of mean of timbral texture features. A total of 48 features (6 timbral texture features x 4 statistical features = 24 for each emotional speech signals + 6 timbral texture features x 4 statistical features = 24 for each glottal waveforms) were derived for each emotional speech signals and its glottal waveforms.
SWT based timbral texture features (SWT-TTFs). The pre-emphasized emotional speech signals and glottal waveforms were decomposed into five levels using SWT with 10 th order Daubechies wavelet. In this work, Daubechies wavelet has been chosen due to the following properties [56]: Time invariance, fast computation and sharp filter transition bands.Timbral texture features (Energy entropy, short-time energy, zero-crossing rate, spectral rolloff, spectral centriod and spectral flux) were extracted from the decomposed stationary wavelet coefficients (CA5, CD5, CD4, CD3, CD2 and CD1). After obtaining the timbral texture features for each decomposed stationary wavelet coefficients, the following statistical parameters were computed such as standard deviation of timbral texture features, maximum by standard deviation of timbral texture features, maximum by median of timbral texture features, square of standard deviation by square of mean of timbral texture features.A total of 288 features (6 timbral texture features x 4 statistical features x 6 subbands = 144 for each emotional speech signals + 6 timbral texture features x 4 statistical features x 6 subbands = 144 for each glottal waveforms) were derived for each emotional speech signals and its glottal waveforms after SWT decomposition.
Relative wavelet packet energy and entropy features (RWPFs). The pre-emphasized emotional speech signals and glottal waveforms were segmented into 32 ms frames with 50% overlap. Each frame was decomposed into 4 levels using discrete wavelet packet transform with 10 th order Daubechies waveletand relative wavelet packet energy and entropy features were derived for each of the decomposition nodes as given in the Equations (4) and (7).
Relative wavelet packet energy; RWPEGY ¼ EGY j;k EGY tot ð4Þ Relative wavelet packet entropy; RWPEPY ¼ EPY j;k EPY tot ð7Þ where j = 1,2,3,. . .m, k = 0,1,2,. . .,2 m -1, m is the number of decomposition level and L is the length of wavelet packet coefficients at each node(j,k). Four level wavelet packet decomposition give 30 wavelet packet nodes and features were extracted from all the nodes which yield 60 features (30 relative energy features + 30 relative entropy features). Similarly, the same features were extracted from emotional glottal signals. Finally, a total of 120 features were obtained. After obtaining 120 relative wavelet packet energy and entropy based features for each frame, they were averaged over all frames.

PSO clustering for Feature Enhancement
Clustering methods have been widely used in various applications, such as statistics, software engineering, biology, psychology and other social sciences, in order to group the similar objects/instances in large amounts of data [57,58,59,60]. In any pattern recognition applications, escalating the inter-class variance and diminishing the intra-class variance of the attributes or features are the fundamental issues to improve the classification/recognition accuracy [57,58,59,60]. High intra-class variance and low inter-class variance among the features may degrade the performance of classifiers which results in poor emotion recognition rates. To decrease the intra-class variance and to increase the inter-class variance among the features, PSO based clustering was suggested in this work, to improve the discriminative ability of the extracted features. In 1995, Eberhart RC and Kennedy J have originally proposed a stochastic optimization approach which is called PSO [61]. The main problem with the PSO is that particles can get trapped in the local optimum. Van der Merwe D and Engelbrecht AP have suggested PSO for data clustering and obtained promising results [60]. Inspired by social interaction of humans in a global neighbourhood, Cohen SC and de Castro LN have proposed PSO based clustering to organize the data-points into clusters based on the interdependence of each particle [62]. In 2010, a modified PSO based clustering was proposed by Szabo, which did not require velocity and inertia weight during update procedure [58,59]. Mitchell Yuwono etal. have proposed a simple modification to mitigate the time complexity by reducing the frequency of distance matrix update [8]. Motivated by the previous works, PSO based clustering was suggested to enhance the discrimination abilityof the extracted features.The task of the PSO here is to search for the appropriate cluster centres such that the clustering metric (Euclidean distance) is minimized [6,8,58,59,60,62]. The steps involved in the PSO based clustering [6,8,58,59,60,62] are as follows: Input: Feature Dateset K: number of classes(emotions) Output: the location of K centroids (cluster centers) PSO_clustering(data, K) Generate the particles; each solution has its own K cluster centers selected randomly from dataset.

For each particle Objective function = min(Euclidean distance)
Update p gd End where w is an inertia weight which plays an important role of balancing local and global search and usually decreased linearly [w(t+1) = 0.85 Ã w(t)] during iterations [8]. c 1 and c 2 are two positive acceleration constants and fixed equally as 2.The initial value for w was fixed as 0.9 and maximum number of iterations was fixed as 100 [8]. If particles are getting trapped into local optimum, particles were reset to zero. Theworking of PSO based clustering as feature enhancement method is summarized (in Fig. 2) as follows: firstly, the appropriate cluster centersof each feature belonging to the dataset using PSO basedclustering were found. Next, the ratios of means of featuresto their respective cluster centers were calculated. Finally, these ratios weremultiplied with each respective feature to enhance their discriminative quality between the groups/classes.

Feature Selection using PSO
Feature selection is an essential step prior to classification process to eliminate the redundant features, to select parsimonious, information-rich features and to avoid overfitting during classification [63,64,65,66]. Feature transformation and selection algorithms are commonly used to reduce the feature dimension and to select the most informative features. In this work, PSO based feature selection was proposed to select the best information-rich weighted features. The flowchart of the proposed PSO based feature selection was shown in Fig. 3. Conventionally, particles are initialized randomly. However, in this work, mixed initialization strategy was used. In this strategy, 50% of particles were initialized using a small number of features (10% of total features) and other particles were initialized using a large number of features (60% of total features) [11].
The main step in the PSO based feature selection is the goodness/fitness evaluation procedure. Generally, the two popular measures such as classification accuracy and error rate will be used in designing a fitness function. However, those measures will be unsuitable to measure the quality of the particles when dealing with the imbalanced dataset as they mislead the classification performance due to the emphasis on the influence of the majority class [67]. Hence, in this work, a new fitness function was developed to evaluate the fitness of the each particle, where the classification performance was evaluated through Geometric mean (G-mean).
where α is used to show the relative importance of the classification performance (G-mean) and (1-α) shows the relative importance of the number of features. As the classification performance is more important than the number of features, the value for α was fixed as 0.8. Based on the fitness function (Equation 8), the quality of each particle was calculated. After evaluating the fitness of all particles, the algorithm updates the pbest and gbest, and then updates the velocity and position of each particle. pbest and gbestwere updated in two situations. In first situation, the current pbestwas updated, if the classification performance (G-mean) of the particle'snew position was better than that of previous pbest and the number of featureswas not larger than previous pbest. In second situation, the current pbest is updated, if thenumber of features was smaller than previous pbest and the classification performance (G-mean) of the new position was the same or better than the current pbest. gbestwas updated in the same way [11]. The position of a particle represents a selected feature subset. In our binaryPSO, v-shaped transfer function was applied to transform the velocity fromcontinuous space to probability space [9]: As we have used v-shaped transfer function, the following position updating rules should be used [9].
The PSO simulation will stop when a pre-defined stopping criterion, e.,g the maximum number of iterations or an optimal fitness value, has been reached. Maximum number of iterations was fixed as 100. If particles are getting trapped into local optimum, particles were reset to zero. The initial value of w was set as 1.4 and changed adaptively during iteration using the following equation [68].
where t max and t are the maximum number of iterations and the current iteration.

Extreme Learning Machine
A new learning algorithm for the single hidden layer feedforward networks(SLFNs) called as ELM was proposed by G.B. Huang et.al [69,70,71,72]. It has been widely used in various applications to overcome the slow training speed and over-fitting problems of the conventional neural network learning algorithms [69,70,71,72]. The brief idea of ELM is given as follows: [69,70,71,72] For the given N training samples, the output of a SLFN network with L hidden nodes can be expressed as the following: It can be written as f(x) = h(x) β, where x j ,w i and b i are the input training vector, input weights and biases to the hidden layer respectively. β i is the output weights that links the i-th hidden node to the output layer and g(.) is the activation function of the hidden nodes. Training an SLFN is simply finding a least-square solution by using Moore-Penrose generalized inverse: Where H † = (H'H) -1 H' or H'(HH') -1 , depending on the singularity of H'H or HH'. Assume that H'H is not a singular, the coefficient 1/ ( is positive regularization coefficient) is added to the diagonal of H'H in the calculation of the output weights β i . Hence, more stable learning system with better generalization performance can be obtained. The output function of ELM can be written compactly as In this ELM kernel implementation, the hidden layer feature mappings need not to be known to users and Gaussian kernel was used. Best values for positive regularization coefficient () and Gaussian kernel parameter were found empirically after several experiments.

Emotion Recognition Results
From the literature, it can be observed that the high emotion recognition rates can be achieved for the recognition between high-activation emotions and low-activation emotions; however, recognition between different emotions (multi-class) is still challenging. To improve the speaker-independent emotion recognition accuracy, we have suggested PSO based feature enhancement and feature selection method. In addition to speaker-independent (SI) emotion recognition, we have also conducted experiments on speaker-dependent (SD), gender dependent (GD-male and GD-female) environments. Three different emotional speech databases were used to gauge the robustness of the proposed method. From the speech utterances, glottal waveforms were derived. A total of 614 features derived from both speech utterances (307 features) and glottal waveforms (307 features). PSO based clustering was used to enhance the discriminative ability of the extracted features and PSO based feature selection was proposed to select the best weighted features. Modified particle initialization, pbest and gbest update scheme and a new fitness function were used to improve the feature selection process. ELM kernel classifier was used. The proposed method was implemented under MATLAB platform using a LAPTOP with Intel Core i7-2.2 GHz and 4 GB RAM. Figs. 4 and 5 depicts the class distribution plots of raw and weighted features for BES database. From the Fig. 4, a higher degree of overlap among raw features can be observed. According to the Fig. 5, inferences show that after PSO based feature enhancement, the weighted features could provide relatively better separable class distribution. Twenty five independent simulations (runs) of PSO based clustering and PSO based feature selection were conducted. Table 2 provides the details of selected weighted features using the proposed wrapper based PSO. Most frequently selected weighted features were identified during twenty-five independent PSO runs and used for emotion recognition experiments. Table 3, 4 and 5 shows the average emotion recognition results in terms of confusion matrices for raw, weighed and selected weighted features under different experiments.
According to the Table 3 (BES database) (Table 5), average emotion recognition rates (seven emotions) using all the weighted features were improved from 72.32% (SD/GD), 35.00% (SI) to 98.96% (SD/GD), 75.36% (SI). An average emotion recognition rate of 94.01% in SD experiment and 69.13% in SI experiment were achieved using the best weighted features (average of 27 weighted features from speech signals + average of 26 weighted features from glottal waveforms). The results obtained for BES and SAVEE database were significantly better than the results presented in the literature. A paired t-test was performed with the significance level of 0.05 on the emotion recognition results obtained using the raw and weighted features. In almost all cases, emotion recognition results obtained using the weighted features were significantly better than using the raw features. From the above experiments and results, higher emotion recognition rates between different emotions were obtained using weighted features compared to raw features.

Conclusions
Improved speaker-independent multi-class emotion recognition can provide a better communication between human and machine. In this study, we have investigated the effectiveness of PSO based clustering and feature selection algorithm to enhance the extracted speech features and to improve the multi-class speaker independent emotion recognition accuracy as well.
Emotion recognition experiments have been conducted with three different emotional speech databases using the proposed method. Both speech and glottal waveforms were subjected to feature extraction. Four different experiments such as SD, SI, GD-Male and GD-Female were conducted. After PSO based clustering, the discrimination ability of the extracted features has been improved which provides higher emotion recognition accuracy. Only less than 10% of total weighted features have been selected based on PSO based feature selection with improved fitness function. The experimental results demonstrated the merits of the proposed method in the field of emotion recognition. The highest emotion recognition accuracy in all experiments also showed the effectiveness of the ELM-kernel classifier. From the results, we can also conclude that the proposed method yielded a higher emotion recognition accuracy compared to the state of the art works in the literature for the emotional speech databases under test. In future work, the results of proposed PSO based clustering and feature selection will be compared with other counterparts. The proposed method will be tested using larger corpora and more naturalistic corpora. Cross-cultural or cross-linguistic validity of the proposed method will also be performed.