SUST Bangla Emotional Speech Corpus (SUBESCO): An audio-only emotional speech corpus for Bangla

SUBESCO is an audio-only emotional speech corpus for Bangla language. The total duration of the corpus is in excess of 7 hours containing 7000 utterances, and it is the largest emotional speech corpus available for this language. Twenty native speakers participated in the gender-balanced set, each recording of 10 sentences simulating seven targeted emotions. Fifty university students participated in the evaluation of this corpus. Each audio clip of this corpus, except those of Disgust emotion, was validated four times by male and female raters. Raw hit rates and unbiased rates were calculated producing scores above chance level of responses. Overall recognition rate was reported to be above 70% for human perception tests. Kappa statistics and intra-class correlation coefficient scores indicated high-level of inter-rater reliability and consistency of this corpus evaluation. SUBESCO is an Open Access database, licensed under Creative Common Attribution 4.0 International, and can be downloaded free of charge from the web link: https://doi.org/10.5281/zenodo.4526477.


Introduction
While communicating, people try to understand each other's content of speech as well as the active emotion of the speaker. This is depicted by their body language and speech delivery. The study of emotion recognition is important to perceive human behaviors and their relationships both for social studies and human-computer interaction (HCI). It is also important to understand some physiological changes in humans. Research on Speech Emotion Recognition (SER) has been drawing increasing attention of researchers since the last two decades. The first requirement of a functional SER system is to develop a corpus containing useful emotional contents. Studies show that emotional expressions vary from culture to culture [1,2]. For example, in the Japanese culture, expressing some strong emotions, like anger, are considered to be anti-social [3], whereas this may be different for other cultures. A specific language does not only represent some alphabetic symbols and rules, it also points to a specific culture. For cross-language experiments of emotions, the researchers need different emotional speech  [4]; English vs German [5]; Chinese vs German and Italian languages [6], and so on. The results show that the SER models performed best for the particular languages on which those were trained on. The motivation of creating SUBESCO is to develop a separate dataset for the Bangla language to facilitate the construction of SER as using a non-Bangla dataset to train Bangla SER may yield a poorer recognition rate.
In recent years, a lot of researches have been conducted on SER for other languages [7][8][9]. For Bangla language, there is still the lacking of a useful linguistics resource for emotion recognition although there are a few natural speech corpora available for this language [10][11][12]. But, natural speech corpus is not suitable for speech emotion recognition because all the sounds are recorded in the neutral mode. To study different types of emotions, a corpus has to consist of a high number of, ideally equally distributed, speech utterances for each emotion. An emotional audio corpus is also important for speech synthesis to produce speech with emotional contents [13]. Development of such an emotional corpus is considered to be relatively expensive as professional speakers are needed to express natural-like emotional speeches. Also, a high-quality audio lab is required for recordings to preserve the clear spectral details in the speeches [14] for scientific analysis. The ultimate goal of this study was to build a validated emotional speech corpus as a linguistic resource that will be useful for prosodic analysis of emotional features and emotion recognition for Bangla Language. Researchers can use this dataset to train ML models for classifying basic emotions from Bangla speech. This also will allow carrying out exciting linguistic analyses in comparing Bangla to other related regional languages, e.g., Hindi, Punjabi.

Related works
Though a few attempts [15] have been made for developing resources for speech emotion analysis for Bangla, they are limited to a few hundreds of utterances. This scope limits the development of an SER based on deep learning approaches, which often needs thousands of training inputs. Recently a group of researchers have developed a small dataset of 160 sentences to implement their proposed method of Speech Emotion Recognition [16]. The acted emotions were happy, sad, angry, and neutral; this dataset with 20 speakers lacks perceptual evaluation data. The most important feature of SUBESCO is that this is the only human validated, gender-balanced emotional speech corpus for Bangla, to date. Literature survey shows a number of mentionable relevant researches done on the development and evaluation of emotional corpus for other languages. American English has a rich collection of versatile linguistic resources in the field of emotional analysis researches. RAVDESS [17] is an audio-visual multi-modal emotional corpus built for American English which has recently been released. It consists of 7356 recordings delivered by 24 professional actors and validated by more than 250 validators. The simulated emotions for this corpus are calm, happy, sad, angry, fearful, surprise, and disgust. Another remarkable emotional corpus IEMOCAP [18] was developed for this language a decade before RAVDESS was built. IEMOCAP contains speech, facial expressions, head motion, and hand gestures for happiness, anger, sadness, frustration and neutral state consisting of approximately 12 hours of data which was validated by 6 listeners. EMOVO [13] speech dataset was built for the Italian language where 6 actors simulated the six emotions of disgust, fear, anger, joy, surprise and sadness. Two different groups of 24 listeners validated the corpus for the human perception test. A database of German emotional speech [14] developed by Burkhardt et al. consists of 800 sentences recorded from the voices of 10 German actors. There are seven emotions neutral, anger, fear, joy, sadness, disgust, and boredom acted out by the speakers. 20 subjects participated in this evaluation task. James et al. developed an open-source speech corpus [19] for 5 primary emotions and 5 secondary emotions consisting of total 2400 sentences for the New Zealand English. There were 120 participants for the subject evaluation of the corpus. An Arabic speech corpus [20] of 1600 sentences was developed by Meftah et al. associating five emotions neutral, sadness, happiness, surprised, and questioning acted out by 20 speakers. Later, it was evaluated [21] by nine listeners to perform a human perception test. There is an emotional speech corpus [22] of 12000 Indian Telegu emotional speech utterances involving eight emotions anger, compassion, disgust, fear, happy, neutral, sarcastic, and surprise. It used 25 listeners to experiment human perception ability of the recorded audios. Recently similar researches have been done for Chinese Mandarin [23], Bahasa Indonesia [24], Greek [25], Danish [26], and several other languages. Depending on the purpose of emotional corpus some researches involve multiple languages [27]. There are some emotional databases that were built using recorded data from human conversations [28,29], while some are also developed using synthesis systems [30]. Sometimes an emotional corpus involves crowd sourcing for recording evaluation [31]. Ververidis and Kotropoulos [32] have reviewed some important emotional speech databases for different languages.

Ethics declaration
This corpus has been developed as part of a Ph.D. research project in the Department of Computer Science and Engineering of Shahjalal University of Science and Technology, Sylhet, Bangladesh. Permission was granted by the Ethical Committee of the University to involve human subjects in the research work for the development of the corpus. Written consent was also obtained from all the speakers to publish their voice data and photographs. There were also paid human volunteers involved in data preparation and evaluation experiments. They also gave their written consent to publish their evaluation results and information, if necessary.

Scope of the database
The developed corpus consists of voice data from 20 professional speakers where 10 of them are males and 10 are females (age: 21 to 35, mean = 28.05 and SD = 4.85). Audio recording was done in two phases with 5 males and 5 females participating in each phase. The gender balance of this corpus has been ensured by keeping an equal number of male and female speakers and raters. There are seven emotional states recorded for 10 sentences. Five trials were preserved for each emotional expression. Hence, the total number of utterances involves 10 sentences × 5 repetitions × 7 emotions × 20 speakers = 7000 recordings. Each sentence length has been kept fixed at 4 sec removing only the silences preserving the full words. The total duration of the database is: 7000 recordings × 4 sec = 28000 sec = 466 min 40 sec = 7 hours 40 min 40 sec. Considering the number of audio clips and the total length of the corpus, at present this is the largest available emotional speech database for Bangla language. A summary of the database is given in Table 1.

Selection of emotion categories
The most challenging task of emotion recognition is to find the distinguishing features of target emotions. Researchers are still searching for a consensual definition of emotion. Since only voice data is being considered here for the data base design, for correct emotion recognition it is necessary to consider the emotional states which have intense impacts on voice data. For the corpus development, the set of six fundamental emotions [33]: anger, disgust, fear, happiness, sadness, surprise were considered along with neutral emotional state. These are probably the most frequently occurring emotions in all cultures around the world [34]. Neutral can be compared to the Peaceful or Calm state. For this dataset, all the speakers had to simulate these seven emotions for the target 10 sentences. This corpus consists of an equal number of recorded audio clips for all emotions, indicating that the corpus has balanced data in terms of desired emotional states.

Type of emotion: Acted or real?
Databases of emotional speech corpus may be classified into three types [35] based on the nature of the speech collected: speeches collected in-the-wild, simulated or acted emotional speech database, and elicited emotional speech database. In type 1, real-life emotional speeches are collected from a free and uncontrolled environment for analysis. For example, speeches collected from customer services, call centers, etc. But, the problem is that in-the-wild datasets have no 'ground truth', that is, there is no way to know the intended emotion of the speaker at the time of capture which is important for ML models. This kind of dataset also involves some copyright and legal issues. For this reason, they are often unavailable for public use. Type 2 database is developed by collecting acted or simulated speeches. Trained actors are asked to deliver speeches in different emotional states for predefined texts. Most of the available emotional speech databases are acted or simulated. The problem of these types of databases is that sometimes emotions are exaggerated and fail to represent naturally experienced emotions authentically [36]. For controlled scientific experiments, balanced data recorded in a laboratory environment are needed [37] which can be achieved by type 2 database. Type 3 is an elicited emotional speech database in which emotions are induced in speakers using some context. This is not acted in the sense that speakers are provoked to show the real emotions rather than training them to express acted emotions. These kinds of speeches can be collected from some talk shows or reality shows. There are some challenges to build such a database. For example, few emotions like fear and sadness are ethically not permitted to be induced in speakers. Moreover, in some cases, speakers may deliberately hide their emotions (e.g. disgust or anger) in public places to avoid social issues. For type 1, speeches collected in-the-wild have no ground truth and the intensities of induced emotions in type 3 databases are too weak. For this research, clear, strong, lab-based expressions were needed so the speech was collected through acting to develop a type 2 database.

Database preparation
There were several steps involved in the development of SUBESCO. Fig 1 shows all the steps involved in the preparation of the database.

Context of the corpus
Standard Bangla language has been considered for preparing the text data for developing the emotional corpus. Initially, a list of twenty grammatically correct sentences were made which can be expressed in all target emotions. Then, three linguistic experts selected 10 sentences amongst them for database preparation. One expert is a scholarly Professor of Department of Computer Science and Engineering and has been involved in different Bangla NLP related projects funded by the University and the Govt. of Bangladesh. Other two are Professors of Bangla Department from two different universities. After a trial session it was confirmed that all the aforementioned sentences could be pronounced neutrally. The sentences are structured and kept as simple as possible so that actors can easily memorize and express them. This text data is phonetically very rich which is also important for any language-related research. The selected text dataset consists of 7 vowels, 23 consonants and 3 fricatives covering all major 5 groups of consonant articulation, 6 diphthongs, 1 nasalization (◌ঁ ) of Bangla IPA. The texts of utterances are represented here with two types of transcription, phonemic and phonetic. In phonemic transcription the texts have been represented using the words tier of B-To-BI (Bengali Tones and Break Indices System) [38] transcription and for phonetic transcription the texts have been transcribed with the help of the list of Bangla IPA [39]. In both cases, the phonological model InTraSAL [40], designed by linguistic Professor Sameer Ud Dawla Khan, were followed. Phonemic and Phonetic transcription along with English translations of all the sentences are given below:

Selection of speakers
Speaker selection is a very important task in an emotional speech database preparation. Research [36] has shown that real-life emotion recognition rate is higher for non-professional speakers because sometimes professional speakers produce exaggerated emotional expressions while recording. But, for a natural-like noiseless speech database, the speech expressions of professional actors were recorded in a technical sound studio. Professional artists are very comfortable delivering a speech in front of a microphone for recordings, it was easy to instruct them how to accomplish the task. For audition and voice recordings of all speakers, there were three experts, including the first author. Two paid and trained research assistants helped from the process of audition of the speakers to final preparation of the audio clips. The speaker selection and recordings were done in two phases. In the first phase, 10 professional actors (5 males, 5 females) were selected from 20 participants. A few months later in the second phase of voice recording, another group of 10 professional speakers (5 males, 5 females) were selected from 15 participants. There were 3500 recordings done in each phase. All of the participants were native Bangla speakers and involved in different platforms of theatre act. After the training session, in the audition round in front of the experts they were asked to express Sentence 8 simulating all the target emotions. The experts selected the final participants based on the recognisability and naturalness of their speeches. It took a couple of days to select all the participants. After selection, the speakers filled up the personal information forms and signed the consent papers to deliver their speech, in return for a fixed amount of reimbursement. A recording plan was scheduled according to their availability.

Technical information
The sound recording studio is furnished with acoustic walls and doors that make it anechoic to maintain professional sound studio quality. It is partitioned into a recording room and a control room. The glass partition between the two rooms significantly attenuates the amount of sound passing through. A condenser microphone (Wharfedale Pro DM5.0s) for recording with a suitable microphone stand (Hercules) was provided to the speaker. The average distance between the microphone and the mouths of the speakers was 6 cm. The microphone was connected to a USB Audio Interface (M-AUDIO M-TRACK 2×2, 2-IN/2-OUT 24/192). Two Logitech H390 USB Headphones were also connected to the audio interface so that experts could listen to the speeches at the time of the recordings. Intel(R) Core(TM) i7-7700 CPU @ 3.60Hz 3.60GHz processor was used with an 8.00 GB RAM as the hardware tools and Operating System installed in it was 64-bit Windows 10. There was an HP N223v 21.5 Inch TN w/ LED backlight Full-HD monitor display to visualize the audio during recording. The output was recorded as 32-bit mono wav format at a sampling rate of 48 kHz. For recording and editing, an open-source sophisticated software tool Audacity (version 2.3.2) was used.

Procedure and design
Audio recordings took place in an anechoic sound recording studio which was set up in the Department of Computer Science and Engineering of Shahjalal University of Science and Technology. In the recording room, the speaker was provided a seat and a dialog script which contained all the sentences numbered in sequential order. The same sentence numbering was used in file naming to recognize each recording separately. A condenser microphone was set up in front of the seat and the microphone level could be adjusted so that the speaker could easily deliver his or her speech while sitting on the seat. Professional artists are familiar with the famous Stanislavski method [14] to self-induce the desired emotions. No extra effort was needed to induce the target emotions in them. They were requested to portray the emotions to make the recordings sound like natural speeches as much as possible. Speakers were allowed to take as much time as needed to prepare themselves to express the intended emotions properly. Two trained research assistants were sitting in an adjacent control room and they could visually observe the recorded speech on a computer display. They could also hear the recordings via headphones connected to a controlling device which was also connected with the microphone of the recording room. They could give feedback to the speakers, with predefined gestures, through the transparent glass partition between the two rooms. Some erroneous audio clips were discarded instantaneously and others were saved on the computer for post-processing, if agreed by the experts. The recording was done for one speaker at a time. Each speaker took several sessions to deliver his or her speech to simulate all sentences for the target emotion categories.

Post-processing
Post-processing of the recordings was done to make all data suitable for prosodic analysis and automatic feature extraction. Every speaker needed 9-12 hours for recording and gave 700-1000 takes. From those takes, the best five tracks were selected and the number of the final tracks for every person was 350. After recordings, backup copies of all voice clips were saved as uncompressed wav format files in an external SATA hard disk to avoid unexpected loss of data. During editing, the silence was removed from the audio tracks and they were cut into 4s audio clips without trimming off any word. Peak normalization of -1dB was applied to all the recordings to preserve the natural loudness and distortion-free playback. They were renamed so that each file could be identified separately. After selection, editing, and renaming, all the recordings were reviewed again. They were then stored separately in 20 folders which were named according to the speakers' names.

Filename convention
All the speakers were assigned serial numbers. The sentences were also documented in a predefined sequential order. Serial numbers of speakers were given according to the recording schedules assigned to them. For example, speaker number M_01 indicates to the male speaker whose voice was recorded first of all, M_02 means the person is the second male participant in the schedule, and so on. After editing each audio file was given a separate and unique file name. The filename is a combination of numbers and texts. There are eight parts in the file name where all the parts are connected by underscores. The order of all the parts is organized as: Gender-Speaker's serial number-Speaker's name-Unit of recording-Unit number-Emotion name-Repeating number (last two parts) and the File format. For example, the filename F_12_TITHI_S_10_SURPRISE_TAKE_5.wav refers to: female speaker (F), speaker number (12), speaker's name (TITHI), sentence-level recording(S), sentence number (10), emotional state (SURPRISE), take number (TAKE_5) and the file extension (.wav). The coding is represented in Table 2 for a better understanding.

Corpus evaluation
The main purpose of the corpus evaluation is to find out to what extent an untrained listener can correctly recognize the emotion of the recorded audios. A higher recognition rate indicates a higher quality of recordings. The whole task of the evaluation was done in two phases. Phase refers to a test-retest done in different periods of time to accomplish the evaluation of the corpus. Human subjects were involved to accomplish the task. There were 25 males and 25 females, with a total of 50 raters for each phase. Each rater evaluated a set of recordings in Phase 1, then after a week's break, rated the same set of recordings in Phase 2. In the first phase, the raters evaluated all seven emotions. In the second phase, the emotion Disgust was not considered while the other six emotions were considered, mainly because it was to investigate the fact that Disgust was causing a confusion with other similar emotions. The intention was not to exclude Disgust from the dataset, rather the aim was to release the complete dataset for public use. In this case, Phase 2 can be considered as a separate study related to Disgust which can facilitate the users to decide whether they should include it or not for their audiobased research as studies have found that to be more likely a visually-expressed emotion [41]. The same set of 50 evaluators participated in both phases because the goal was to find out the effect of some important factors on the recognition rate by comparing the results from both phases. All of the raters were Shahjalal University students from different disciplines and schools. All of the participants were both physically and mentally healthy and aged over 18 years at the time of the evaluation. None of them participated in the recording sessions. All of them were native Bangla speakers and they could efficiently read, write and understand the language. The raters were not trained on the recordings to avoid any bias in their perception ability. A management software was developed including a user interface for the overall management of the corpus evaluation task. Listeners could access the interface through Wi-Fi and wired LAN connections in the same building. For the first phase of the evaluation, 25 audio sets were prepared where each set contained pseudo-randomly selected 280 stimuli. For the second phase, those audio sets contained 240 audio files as Disgust was removed from them. Each audio set was assigned to two persons, a male rater and a female rater to ensure that each audio clip is rated twice and by both genders in each phase. Each of the raters were provided with a seat and a computer with a display in front of him or her. Prior instructions were given to them on how to accomplish the task. Before starting the evaluation process, an expert explained to all the participants the whole task, providing some sample audios from an existing emotional speech database of other languages. After starting the experiment, all the selected audios were presented on a screen one after another, in front of each rater. On the screen, each audio was displayed with a track number and all the emotion categories were listed under it. There was a submit button at the bottom of the screen. A rater could select only one emotion, for a single audio, which he thought to be the best match. An example display of the evaluation process is shown in Fig 2 for Phase 1. Similar screen was used for Phase 2 excluding the option for Disgust. If the speaker's intended emotion had matches with the rater's selected emotion it was considered as the correct answer, 'incorrect' otherwise. A correct answer had a score of 1

PLOS ONE
and an incorrect answer had a score of 0. During the evaluation, each time the rater submitted an opinion it was automatically converted to 1 or 0. The rater's selected emotion was also preserved for later analysis. After completing the evaluation, the proportion of the total number of correctly recognized emotions to the total number of audios reviewed by the rater was calculated as 'percent correct score' which is the 'raw hit rate'. The 'percent correct score' was calculated for each emotion per rater. As is well known, only the proportion statistics alone is not sufficient to represent the overall consistency and reliability of any implementation and analysis [42], and it cannot be used directly for performance analysis. Rather, it has been used to calculate several statistical methods to prove the reliability and consistency of the implemented corpus and its evaluation. Raw hit rate does not correct for false alarms that is why unbiased hit rate was calculated based on Wagner's formula [43]. The formula corrects the possibility of selecting the correct emotion category by chance or bias due to answer habit [44].

Reliability of evaluation
The whole task of the evaluation was accomplished in a separate controlled environment classroom so that the students could participate in a noise-free environment and concentrate on the audios. User login and activation deactivation sessions were created in the system for the security of the dataset and to avoid unwanted data from uninvited users. The whole system was managed by the administrator (first author) and a sub-administrator (a research assistant). Before starting, the students were instructed on how to accomplish the task and asked to register with their personal details. It is to be noted that listeners were untrained about the recordings, to avoid any influence of past experience. After registration, each participant's details were verified and his or her id activated for a certain period of time. They were able to review the selected audios intended for them after completing registration and user login. The audio tracks were shuffled before being played to the raters. The submit button was kept deactivated until the user played the full audio so that the user could not proceed to the next audio without listening to the current audio. After submitting one could not move to the previous recordings so as to eliminate the chance to compare the speeches of the same speaker. Listeners were allowed to save and exit the experiment anytime in case of any discomfort and he or she could restart from the saved session if it still remained activated by the administrator. Raters were given a break of 15 minutes after each 45 minute session to avoid any psychological stress due to the experiment. Once a listener submitted all the questions, he or she was deactivated by the administrator to make the retaking of the experiment not possible. The second phase of evaluation was done after a one week break to avoid any influence by the user's memory. All the raters were given at least 95% of the previous audio tracks so that the performances of the two phases could be compared point to point. Clips were reshuffled before playing to confirm that those were in a order different from the first phase.

Phase 1 evaluation
There were 7000 audio clips in this phase. Each audio was rated twice so that the total number of ratings was 14000 in this phase. The overall raw hit rate for this step is 71% with mean SD = 8.24. Table 3 represents the raw hit rate and unbiased hit rate for each emotion. It can be seen that the highest recognition rate was achieved by Neutral (raw = 85%, SD = 7.6), whereas Happiness achieved the second-highest rate (raw = 77.4%, SD = 9.8). Disgust has the lowest recognition rate (raw = 59.1%, SD = 9.6). From Table 4 it is obvious that the largest confusion occurred between Anger and Disgust (18.4%), which is more than 5% of the total ratings for this phase. More than one fourth Anger emotion audios were incorrectly recognized as Disgust. The emotion Disgust is also largely confused with other emotions like Surprise (11.5%), Neutral (7.6%), and Happiness(5.7%). After taking the feedback of the participants it was clear that they were struggling to discriminate between Disgust and other emotions. Studies show that recognizing Disgust as an emotion, based only on audios is not an easy task [41].
In real-life experience, it may be noticed that Disgust executes some facial expressions like frowning, smirking which gives strong cues in understanding it. Neutral was wrongly classified as Sadness and Disgust for more than 5% of the audios. Another major confusion occurred between Fear and Sadness. Fear was wrongly classified as Sadness in more than 19% of the cases. The majority of wrongly classified emotions for Sadness were recognized as Neutral and Fear. Happiness is confused with Neutral, Sadness, and Disgust with a lesser degree. Still, the results are noticeable. There was also a large confusion between Happiness and Surprise. The reason for the confusion between similar emotions is well explained in a study carried out by Posner et al. [45]. It states that a person's emotion recognition ability is associated with the person's prior experience and some internal and external cues. Another study [46] shows that overlapping cognitive schema decreases a person's ability to discriminate between similarly valencing emotions (e.g. Happiness and Surprise). Rater and speaker gender-based mean raw accuracies are shown in Table 5. The overall unbiased hit rate for this phase is 52.2% with SD = 9.0. Unbiased hit rates are lower than raw hit rates for all emotion categories since these are corrected scores and they ignore bias; comparison is shown in Fig 3. The rank order for recognition rates is different than that for raw hit rate. Still, Fear and Disgust are the least recognized emotions. The change

PLOS ONE
SUBESCO: An audio-only emotional speech corpus for Bangla in rank order is due to the major confusions between the emotions: Happiness and Neutral, Anger and Disgust, Fear and Sadness etc. There were 7 choices for each trial, thus the chance level of perception rate is 15%. Both the raw and unbiased hit scores were above the chance level for all emotion categories. Table 6 shows that all the sentences have achieved fair recognition rates for all target emotions. Neutral has achieved the highest scores almost in all cases for emotion perception evaluation.

Phase 2 evaluation
In the second phase, there were 6000 audio clips and the total number of ratings was 12000. After removing Disgust in this phase, interestingly, the mean recognition rate was increased by 80% (SD = 8.4). Table 7 statistics show that there is a sharp rise in the raw hit rate for Anger in this phase, which is more than 20% (66.7% to 87.2%). Recognition rates for Neutral, Happiness, and Surprise also increased by 6.4%, 5.5%, and 6.3% respectively. Other emotions have nearly the same results compared to the previous phase. This confirms the doubt that Disgust was suppressing the overall recognition rate. But, a look at the confusion matrix in Table 8 reveals that still major confusions persisted between Fear and Sadness(14.5%), Neutral and Sadness (8.5%), Fear, and Surprise (6%). In this phase, Neutral has the highest recognition rate Table 5. Average recognition rates for males and females in Phase 1.

PLOS ONE
(91.5%) whereas Fear has the lowest recognition rate (67.2%) ( Table 7). For both phases, female raters have relatively higher recognition scores compared to male raters, for all cases (Tables 5 anmd 9). Gender effects will be discussed more elaborately later in this paper. Rank order for unbiased hit rates is completely different in this phase (unbiased = 68.1, SD = 9.22). Anger has the highest position (unbiased = 78.3, SD = 8.2) due to the removal of large confusion introduced by Disgust. Sadness has the lowest position (unbiased = 53.2, SD = 10.1). Relative positions between Happiness and Neutral, Surprise and Fear are still the same. Since there

PLOS ONE
were six options for each rating to the users the chance level of responses for this phase is 17%.
Recognition scores for all emotion categories were higher than 17%. Raw and unbiased hit rates are compared in Fig 4 for Phase 2. Sentence-wise recognition rates for all emotions for this phase is represented in Table 10. Table 9. Average recognition rates for males and females in Phase 2.

Statistical analyses
For the hypothesis tests, the confidence interval was set at 95%, therefore, a null hypothesis was rejected when the probability statistics p-value become less than the critical value =.05. Kappa statistics and Intra-class Correlation (ICC) were carried out to analyze the reliability of the rating exercises. All the test statistics were rounded off to the 2nd decimal and all probability statistics were rounded off to the third decimal. The data was analyzed using the Python programming language. Two-way ANOVA was performed as a further study after evaluation task to investigate the variability and interaction of the main factors which are Gender and Emotion. To perform ANOVA on the data, the raw hit rates for each rater for all emotions categories were considered for both phases. For speaker data, raw hit rate for each emotion category against each speaker's performance was calculated. Actually, ANOVA was performed on those data after log transformation as normality of distribution was not satisfied. Pairwise comparisons have been executed for post-hoc analysis.

Inter-rater reliability
Fleiss' Kappa was used to investigate the inter-rater reliability of the evaluation. It is an adaptation of Cohen's Kappa which is used when the rater number is more than two. Fleiss' Kappa requires that each subject should be different and there should be a fixed number of raters for each file and not every rater needs to evaluate all files [47]. In this database, there are 10 sentences, 7 emotions, and 20 speakers for recordings. Thus, the total number of unique utterances is 1400. If the repeating takes are considered as a single file then it can be said that during the evaluation each audio was validated 10 times. It is assumed that N = number of files, n = number of ratings for each file and k = number of categories. Then, for each phase, the calculation of Kappa score involved an N×n matrix, and the total number of reviews was N × k. According to the guidelines established by Landis and Kotch [48], Kappa scores < 0 indicate poor agreement, 0.01-0.20 indicate slight agreement, 0.21-0.40 fair agreement, 0.41-0.60 moderate agreement, 0.61-0.80 substantial agreement, and 0.81-1 indicate almost perfect agreement. Phase 1 inter-rater reliability obtained a mean Kappa value of 0.58 which is considered as a moderate agreement between the raters. Mean Kappa value for Phase 2 was 0.69 which indicates a substantial agreement between the raters. It can be inferred that the difference in Kappa scores between the two phases is due to the higher recognition rates in Phase 2.
The overall result says that raters' performances were consistent across the overall evaluation of the audios.

Intra-class correlation coefficients (ICCs)
The Intra-class correlation coefficient is a reliability index when n number of files are rated by k number of judges [49]. It reflects both the consistency and measurement agreements between the raters [50]. Several intra-class correlations are used to evaluate inter-rater, intrarater, and test-retest reliability for different rating trials. As all the raters were not provided with all the audio clips for evaluation, only ICC(1,1) and ICC(1,k) were considered for this case. ICC (1,1) is a one-way random effect, an absolute agreement for a single rater; and ICC (1,k) is a one-way random effect, an absolute agreement for multiple raters. The reported value of ICC(1,1) for this analysis was 0.751 which indicates good reliability of measurements. For ICC(1,k) the obtained value 0.99 indicates excellent reliability for 95% confidence interval according to the guidelines suggested by Koo and Li [51]. It suggests that values < 0.40 indicate poor agreement, 0.40-0.59 fair agreement, 0.60-0.74 good agreement, and 0.75-1 indicate excellent agreement. ICC estimates and their 95% confidence intervals were calculated using open-source statistical package pingouin for Python version 3.6.5. The statistics are presented in the Table 11. Note that the estimator is the same regardless of the presence of interaction effects.

Normality test
Before applying any statistical test on a dataset, it is worth investigating its probability distribution type. The Jarque-Bera and Shapiro-Wilk tests were applied on one-way ANOVA residuals to examine the probability distributions of data based on Rater Gender, Speaker Gender and Emotion. The null hypotheses of these tests assume that the target population is normally distributed. According to the test statistics, they were rejected for both tests except Rater Gender data for Phase 1 for the Jarque-Bera statistics (Table 12). That means the normality assumption was not satisfied for those factors. Figs 5 and 6 present the probability plot distributions of the factors. It can be seen that the distributions are not normal, data points are skewed and removed from the fitted lines. Therefore, log transformation was applied to the data to normalize it. Transformation could not normalize all data except for Phase 1 Rater Gender and Emotion. Still, the Two-way ANOVA was performed as it was assumed that it is robust against normality assumption violation when the sample size is large and homogeneity assumption is satisfied.

Homogeneity test
There are two main assumptions of ANOVA, normality of data and homogeneity of variances.
The Levene and Bartlett test were performed on Emotion, Speaker Gender and Rater Gender to test for the homogeneity of their variances. The analysis was carried out for Phase 1 and Phase 2 raw hit rate data. The null hypothesis of the test states that, the population variances are equal across the groups of the variable. For all of the variables the results were not statistically significant and thus confirmed the homogeneity of variances across the groups of the variables. Table 13 represents the details statistics of homogeneity tests.

Two-way ANOVA
Two-way ANOVA was conducted on log transformed data to find out Gender and Emotion interaction effects. It was calculated for the rater and speaker genders separately, for both phases, at the significance level of p < 0.05. The independent variable in all cases was the raw hit rate. For rater gender at Phase 1, the main effect of Gender was statistically significant for emotion perception at the condition of F(1,336) = 11.96, p <.001. Emotion also has a significant main effect on average hit rate where F(6,336) = 58.95, p <.001. Two-way interaction between Gender and Emotion was also significant for F(6,336) = 2.59, p <.05. Table 14 shows the summary results for this phase. For Phase 2, from Table 15 it is seen that the main effect of rater Gender was statistically significant for F(1,288) = 6.58, p <.05. The main effect of Emotion was also statistically significant, where F(5,288) = 53.94, p <.001. But the two-way interaction between Gender and Emotion was not statistically significant, where F(6,336) = 0.11, p >.05 . Tables 16 and 17 show the results of a two-way ANOVA between Speaker Gender and  Emotion for transformed raw hit rates. The results show that the Speaker Gender does not have statistically significant main effects on perception rates (p >.05). But, Emotion has a statistically significant main effect on the recognition rate for both phases. Where, F(6,126) = 4.27, p <.001 for Phase 1 and F(5,108) = 4.10, p =.001 for Phase 2. There is no statistically significant interaction between Speaker Gender and Emotion. This indicates that, there is no statistically significant evidence that the male speakers performed better for any specific emotion recognition compared to females or vice versa. As the data does not hold normal distribution equivalent non-parametric tests were carried out to see the effect of Gender and Emotion on recognition rate. Similar results were found after conducting those tests, the analysis is discussed in Appendix.

Post-hoc analysis
ANOVA shows that there is a significant difference between the means of the populations. If the F-score of a factor is statistically significant then further detailed analysis is done using post-hoc analysis. This test is used on the variable that has more than two population means [52]. Moreover, the test assumes that homogeneity of variances assumption is satisfied, and sample sizes are equal for the variables [53]. Pairwise t-test was conducted to compare the

Discussion
The main purpose of this study is to represent SUBESCO, the largest emotional audio dataset for Bangla. It is the only gender-balanced, human validated emotional audio corpus for this language. The overall evaluation was presented using two studies done in Phase 1 and Phase 2.
Overall raw hit rate for Phase 1 was 71%, and for Phase 2, it was 80%. If we look at the perceived hit rates for other relevant audio-only datasets including: Arabic database: 80% for 800 sentences [21], EMOVO: 80% for 588 files [13], German database: 85% for 800 sentences [17], MES-P: 86.54% for 5376 stimuli [23], Indonesian speech corpus: 62% for 1357 audios [24], Montreal affective voices: 69% for 90 stimuli [54], Portuguese dataset: 75% for 190 sentences [55], RAVDESS: 62.5% for 1440 audio-only speech [17]; these results confirm that the perceptual hit rate of SUBESCO was comparable to existing emotional speech sets. Unbiased hit rates were also reported along with raw hit rates to address false alarms. A separate study in Phase 2 evaluation was carried out to test the effect of Disgust to facilitate researcher's selection of Disgust stimuli for their research paradigm. Inter-rater agreement is substantial with 0.58 for Phase 1 and 0.63 for Phase 2, respectively. Also, excellent Intra-class correlation scores were achieved: ICC = 0.75 and 0.99 for single and average measurement respectively. Fleiss' Kappa and ICC confirms the reliability and consistency of the built corpus. It can be said that SUB-ESCO was successfully created and evaluated with a fair enough recognition rate which makes it a very useful resource for further study on emotional audio for Bangla. The set of stimuli can be used by the researchers of other languages for prosodic analysis. It can also be used in cognitive psychology experiments related to emotion expression. Detailed analysis are presented of the emotional categories based on recognition rates. Also the effects of gender on the perception of emotion were observed. The outcomes of the experiments and statistical analyses suggest the following outcomes. Neutral has the highest recognition rate and it is easy to recognize. Happiness and Sadness have relatively higher recognition rates than others except for Neutral. There has been a difficulty in recognising Disgust based on audio only. However, adding a facial expression (video) might help recognizing it better. Disgust is likely to be confused with Anger, as the recognition rate of Anger improved noticeably after removing it in the second phase. Surprise also achieves a better score without Disgust. Fear has a low recognition rate in both phases. Subsequent analysis of the evaluation revealed that the recognition rate is not influenced by the gender of the speakers. However, the gender of a rater is associated with the recognition rate. Sentence-wise perception rates have been presented in Tables 18 and 19. It can be seen that Neutral has achieved highest or second highest recognition rates for all sentences, and also all other emotions have achieved fairly good perception rate for all the sentences.
There are a few limitations of this study. Such as, only one mode of emotional expressions has been evaluated to simulate seven emotions. It will be interesting and desirable to carry out further research considering multiple modes of emotion expressions. For example, both video and audio expressions of emotion can be analyzed. During the evaluation of the corpus, only a subset of 280 files was given to each rater. Where in the ideal cases, each rater should rate all the audio files and each file should be rated several times by different users. In reality, rating all 7000 audio files by a rater in a single session is not possible. According to Ekman [48], compound emotions are formed by the combination of basic emotions (e.g. smugness is a combination of happiness and contempt). If an SER system is developed successfully for those basic emotions, in future it can be modified to recognize compound emotions correctly. It is important to select neutral sentences to confirm that the evaluation result is not biased by the semantic contents of the sentences. Purely neutral sentences are hard to express in target emotions. According to Burkhardt et. al. [14] nonsense sentences, fantasy words and speeches used in everyday life can meet this requirement. In that sense, Sentence 10 can be considered as fantasy words taken from fairy tales. Others can be considered as normal speeches taken from daily life. However, the expression of emotion of speech actually depends on context. It is also very important to ensure that these sentences can also be expressed in all of the target emotions.

Conclusion
This paper presented the development and evaluation of a Bangla emotional speech corpus SUBESCO. It is an audio-only emotional speech database containing 7000 audio files which was evaluated by 50 validators. Several statistical methods were applied to analyze reliability of the corpus. Good perception rates were obtained for human perception tests (up to 80%). Reliability indices also showed quite satisfactory results. Two-way ANOVA was executed to analyse the effects of Gender and Emotion. The normality and homogeneity of data for these factors was also investigated using Jarque-Bera, Shapiro-Wilk, Levene, and Bartlette tests. A high rate of reliability and consistency of evaluation task shows that this corpus should be considered as a valuable resource for the research on emotion analysis and classification for Bangla language.

Non-parametric tests
Emotion effects. ANOVA results showed that Emotion has a significant main effect on recognition rate. As the data does not show a normal distribution, a non-parametric test Kruskal-Wallis was conducted to investigate the distributions of emotion categories. The null hypothesis states that all emotions have the same population distribution. The Kruskal-Wallis results in Table 20 show that there was a statistically significant difference between the average emotion recognition rates for different types of emotion. The test statistics is H(6) = 167.09, p <.001 for Phase 1, and H(5) = 153.80, p <.001 for Phase 2.
For post-hoc analysis of the Kruskal-Wallis test, non-parametric pairwise multiple comparisons were conducted using Dunn's test at the significance level of.05. Tables 21 and 22 represent multiple pairwise Dunn's test p-values for Phase1 and Phase 2, respectively. The test results are similar with that of pairwise t-test with a very few exceptions. It is clear that the emotion recognition rate is dependent upon the type of emotional state.
Rater's gender effect. The Mann-Whitney U test was used to compare the population means of raters' performances based on their genders. For Rater gender as depicted in Table 23, the test statistics were significant for both of the phases. Therefore, the null hypothesis was rejected indicating male raters' data has a different distribution as compared to female raters' data. A Chi-square test of independence was carried out to investigate whether there is an association between rater's gender and the emotion recognition rate. A significant relationship was found between the two variables, which means the average recognition rate is not independent of the rater's gender. For Phase 1, the Chi-square value is, X 2 (1, N = 14000) = 12.80, p <.001 and for phase 2, it is X 2 (1, N = 14000) = 11.13, p <.05. The overall Chi-square statistics for rater's data are presented in Table 24. Likewise, from the results of the previous analyses of two-way ANOVA it was found that rater's gender has a significant main effect on emotion recognition rate. Speaker's gender effect. The total numbers of correctly and incorrectly recognized emotions were calculated for each male and female speaker for both phases. The Mann-Whitney U test for speakers' data was applied for both phases. The test statistics shown in Table 25 were not significant for all of the cases which indicates data relating to both genders have the same population distribution. The Chi-square test was also applied to find the association between speaker's gender and overall emotion recognition rate. A Chi-square test yielded the Chisquare value of X 2 (1, N = 14000) = 2.22, p >.05 for Phase 1; and X 2 (1, N = 14000) = 0.06, p >.05 for Phase 2. Table 26 shows that Chi-square statistics are not statistically significant for any phase. That means, there is no evidence that a specific gender of speakers was dominant over another in terms of expression of more recognizable emotions. The outcome matches exactly with that of ANOVA analysis.
Supporting information S1 Data. Phase 1 evaluation measures for all 7000 stimuli of SUBESCO. It includes file name, speaker ID, speaker gender, rater ID, rater gender, intended emotion and rated emotion for each file. (XLSX) S2 Data. Phase 2 evaluation measures for 6000 stimuli of SUBESCO. It includes file name, speaker ID, speaker gender, rater ID, rater gender, intended emotion and rated emotion for each file. (XLSX)