Figures
Abstract
This paper introduces a method aiming at enhancing the efficacy of speaker identification systems within challenging acoustic environments characterized by noise and reverberation. The methodology encompasses the utilization of diverse feature extraction techniques, including Mel-Frequency Cepstral Coefficients (MFCCs) and discrete transforms, such as Discrete Cosine Transform (DCT), Discrete Sine Transform (DST), and Discrete Wavelet Transform (DWT). Additionally, an Artificial Neural Network (ANN) serves as the classifier for this method. Reverberation is modeled using varying-length comb filters, and its impact on pitch frequency estimation is explored via the Auto Correlation Function (ACF). This paper also contributes to the field of cancelable speaker identification in both open and reverberation environments. The proposed method depends on comb filtering at the feature level, deliberately distorting MFCCs. This distortion, incorporated within a cancelable framework, serves to obscure speaker identities, rendering the system resilient to potential intruders. Three systems are presented in this work; a reverberation-affected speaker identification system, a system depending on cancelable features through comb filtering, and a novel cancelable speaker identification system within reverbration environments. The findings revealed that, in both scenarios with and without reverberation effects, the DWT-based features exhibited superior performance within the speaker identification system. Conversely, within the cancelable speaker identification system, the DCT-based features represent the top-performing choice.
Citation: Hassan ES, Neyazi B, Seddeq HS, Mahmoud AZ, Oshaba AS, El-Emary A, et al. (2024) Enhancing speaker identification through reverberation modeling and cancelable techniques using ANNs. PLoS ONE 19(2): e0294235. https://doi.org/10.1371/journal.pone.0294235
Editor: Viacheslav Kovtun, Institute of Theoretical and Applied Informatics Polish Academy of Sciences: Instytut Informatyki Teoretycznej i Stosowanej Polskiej Akademii Nauk, UKRAINE
Received: September 25, 2023; Accepted: October 27, 2023; Published: February 14, 2024
Copyright: © 2024 Hassan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The used data is a subset of the large Chinese Mandarin Corpus https://www.magicdatatech.com/ Please go to https://www.magicdatatech.com/datasets Then search for "Chinese Mandarin Corpus" or go directly to https://www.magicdatatech.com/datasets/tts/mdt-tts-e011-mandarin-chinese-speech-corpus-for-tts-1611045140 Prof. Emad S. Hassan (eshassan@jazanu.edu.sa) will be glad to answer any questions regarding the data mentioned in the article.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
The analysis of speech signals serves as a powerful tool for individual characterization, encompassing aspects such as identity, dialect, age, emotional state, language, gender, and even health status. Each person possesses distinct natural vocal characteristics that distinguish him. Speech has been a fundamental mode of human communication since ancient times, arising from vocal tract excitation. Physiological attributes contributing to speech differ across individuals, including variations in vocal tract size, shape, vocal fold structure, velum, and nasal cavity, especially between genders [1–3].
Speaker recognition is a signal processing technique that aims to identify individuals based on their spoken words. It encompasses two primary categories: Speaker Identification (SI) and Speaker Verification (SV). Identification involves comparing an enrolled voice with stored models to identify the best match, while verification confirms or rejects a claimed identity. SV finds applications in security contexts. Both SI and SV involve the creation of speaker models to be stored as references [4, 5]. The process of SV is also referred to as speaker authentication, wherein the system either accepts or rejects the speaker’s identity claim. If the system denies access to an enrolled speaker’s utterance, the speaker is classified as an impostor. Consequently, SV systems play a crucial role in security applications, thwarting unauthorized entry by individuals [6, 7].
Speaker identification involves recognizing the speakers’ identities by comparing their feature vectors with those stored in the database. For unknown speakers, the system matches their voice models with the existing database, assigning the best-fitting model as the unknown speaker’s representation. This application extends to domestic domains like forensics and identification of individuals involved in criminal cases within a pool of known offenders [2].
Automatic Speaker Identification (ASI) comprises two stages: feature extraction and classification. Feature extraction condenses speech signals into concise data, forming feature vectors that encapsulate distinct speaker characteristics. The speaker identification system operates in training and recognition modes. During training, features of new speakers are extracted and recorded in the database, while recognition involves extracting features for unknown speakers to determine their identities. Mel-Frequency Cepstral Coefficients (MFCCs), widely acclaimed for their robustness in representing clean speech, are the favored features [3, 6]. However, their robustness diminishes in cases of degraded speech quality.
This paper extensively investigates the impact of closed-room environments on speech signals. This impact arises from the numerous reflections occurring off the walls within such spaces. In specific settings, substantial reverberation is anticipated [8, 9]. Consequently, it is likely that the features extracted from speech signals exhibit variances in the presence of reverberation. The exploration extends to the degree of influence exerted by reverberation on cepstral features and pitch frequency, as well as its impact on the whole speaker identification process.
Over the past decade, the notion of cancelable biometrics has undergone significant development. This concept holds particular relevance for enhancing the security of biometric systems, especially those utilized in remote-access scenarios. Cancelable biometrics relies on the utilization of distorted signals or feature patterns, which are extracted to represent speakers [10]. In this paper, the concept of a cancelable speaker identification is adopted by employing a digital comb filter, analogous to the model used for simulating reverberation. It is well-established that reverberation can be effectively modeled using a comb filter. Therefore, an additional comb filter is implemented at the feature level to induce deformations within the features. Subsequently, the impact of these deformations on the speaker identification process is analyzed.
In summary, this paper advances the fields of speaker identification and cancelable biometrics, offering effective solutions for challenging acoustic conditions. The key contributions of this paper can be summarized into the following points:
- Reverberation analysis and modeling: The paper explores the analysis of speech signals in environments with reverberation caused by reflections from closed room surfaces. The reverberation is modeled using comb filters with varying lengths, offering a methodical approach to simulating and understanding its effects.
- Robust speaker identification: The paper presents a robust speaker identification system designed to operate effectively in scenarios with both reverberation and noise, leveraging MFCCs.
- Cancelable speaker identification: Addressing contemporary trends in biometric security, the paper introduces cancelable speaker identification for both open and reverbration environments. A novel technique involves applying comb filtering at the feature level, distorting MFCCs to obscure speaker identities and enhance security.
- ANN classification: The proposed cancelable speaker identification system employs ANNs for classification, achieving high recognition rates in the cancelable biometric recognition framework.
- Finally, the paper outlines three distinct systems: a reverberation-affected speaker identification system, a system depending on cancelable features obtained through comb filtering, and a novel cancelable speaker identification system tailored for challenging reverberation environments.
2. Related work
The study of speech signals in reverbration environments and the development of robust speaker identification systems have garnered significant attention in recent years. This section presents an overview of relevant research in the areas of speech signal analysis, speaker identification, and cancelable biometrics.
Understanding the effects of reverberation on speech signals is a critical aspect. Prior works have investigated various aspects of reverberation modeling and its impact on speech features. Dealing with reverberation in speech processing has been addressed through techniques like dereverberation, which aims to mitigate the adverse effects of reverberation on speaker recognition systems [11]. Methods, such as adaptive filtering and beamforming, have been employed to enhance the quality of reverberant speech [12].
Furthermore, studies have explored the modeling of reverberation using comb filtering, which is utilized to simulate room acoustics and evaluate the performance of speech processing algorithms in reverbration conditions [10]. Traditional speaker identification systems rely on extracting features from speech signals and matching them with reference models [4]. MFCCs have been a common choice for feature extraction due to their effectiveness in clean speech conditions. However, their robustness in the presence of reverberation and noise is a subject of ongoing investigation [3, 6].
Cancelable biometrics has emerged as a promising approach to enhance security in biometric systems. The concept of cancelable biometrics involves the deliberate distortion of biometric features to generate cancelable templates, ensuring that the original biometric data remains protected [10]. Research in this domain has explored various methods for generating cancelable templates, including the introduction of controlled noise, feature-level transformations, and comb filtering. Cancelable biometrics offers potential solutions to privacy concerns and security threats in biometric authentication systems.
Artificial Neural Networks (ANNs) have demonstrated remarkable capabilities in extracting intricate patterns from speech features, enabling high-accuracy speaker recognition systems [13]. The utilization of deep learning architectures, such as Convolutional Neural Networks (CNNs), has further improved the performance of speaker identification models [13]. These developments highlight the potential for ANNs to play a pivotal role in cancelable speaker identification systems. Challenges posed by reverbration environments have been addressed in the literature, with researchers proposing various strategies to enhance speaker identification performance in such conditions. These strategies include the adaptation of feature extraction methods to account for reverberation effects, the utilization of multi-microphone arrays for source separation and dereverberation, and the incorporation of robust feature selection techniques [14, 15].
The authors of [16] developed a semi-sequential two-stage system that combines generative Gaussian Mixture Model (GMM) and discriminative Support Vector Machine (SVM) classifiers with prosodic and short-term spectral features for concurrent gender and identity classification. It operates in a two-stage, semi-sequential manner. The first classifier employs prosodic features to ascertain the speaker’s gender, which is then integrated with short-term spectral features as inputs into the second classifier that is used for speaker identification. This second classifier depends on two types of short-term spectral features, specifically MFCCs and Gammatone Frequency Cepstral Coefficients (GFCCs), in addition to gender information, resulting in the creation of distinct classifiers. The outputs from the different types of second-stage classifier, namely GMM-MFCC Maximum Likelihood Classifier (GMM-GFCC MLC), and GMM-GFCC supervector SVM, are amalgamated at the score level through the weighted Borda count approach. However, none of these prior works explored the use of discrete transforms for feature extraction in the context of speaker identification and cancelable speaker identification systems. Therefore, in this study, we address this gap by investigating the incorporation of discrete transforms into the feature extraction process. Additionally, this paper introduces a novel contribution by applying comb filtering to introduce distortion to MFCCs at the feature level. This distortion is integrated into a cancelable biometric framework, enhancing the system ability to conceal speaker identities and bolstering its resistance to potential intruders.
3. Speaker identification process
The term "feature extraction" is often synonymous with the initial phase of speaker identification. This process plays a pivotal role in both the training and testing phases, as depicted in Fig 1. Serving as the cornerstone, feature extraction captures the paramount information for Automatic Speaker Identification (ASI). It effectively eliminates redundancy, while transforming the speech signal into a suitable format compatible with the classification model. This is achieved by discerning a series of attributes within the speaker’s utterance, referred to as features, which encapsulate the distinctive traits of each utterance. These features harbor discriminative properties tailored to individual utterances, encapsulating their intrinsic characteristics. Regarded as a data reduction step, feature extraction condenses lengthy utterances into compact data that encapsulates the core attributes of the speaker [1].
In summary, feature extraction is unequivocally the linchpin driving the success of the ASI system. Various factors can influence this process, including human-related aspects like inaccuracies in prompted phrase reading, and environmental variables like disparities in recording channels indicating the use of distinct microphones for training and testing as well as recordings conducted in noisy surroundings. Additionally, classification stands out as a pivotal phase within any speaker identification system [14–17].
The classification procedure comprises two distinct phases: training and testing. During the training phase, the extraction of distinctive features from speech samples belonging to registered speakers is imperative. This culminates in the creation of a unique pattern for each speaker that is subsequently archived in a database for later deployment in the matching process. Subsequently, in the testing or matching stage, upon the entry of an unidentified speaker into the system, features are extracted from his speech signal, and correlation is estimated between the models stored in the database and the model derived from the unknown speaker’s utterance. Based on the resulting matching score, a decision is rendered, gauging the similarity between the unknown speaker’s model and the database models. Ultimately, the model that best aligns with the unknown speaker’s model is designated as the speaker’s representative model.
4. Feature extraction stages
The feature extraction has some stages for robust human auditory system representation. Some transformations are used to extract the most important information, as shown in Fig 2.
4.1 Utilization of discrete transforms
In the realm of speaker identification systems, discrete transform domains can give more representative MFCCs. This section delves into the exploration of three pivotal discrete transforms; the Discrete Cosine Transform (DCT), the Discrete Sine Transform (DST), and the Discrete Wavelet Transform (DWT) [18–21]. All of which hold potential for robust MFCC extraction. The forthcoming sub-sections will introduce these transformation techniques and elucidate their outcomes within the scope of the ASI system.
4.1.1 Discrete Cosine Transform (DCT).
The DCT, akin to a Fourier-related transform, exclusively operates with real numbers. Its computation mirrors that of the Discrete Fourier Transform (DFT) conducted on a dataset nearly twice its length. This transform specifically suits real-valued data with even symmetry and exhibits an intriguing energy compaction trait. The significance of this property lies in the potential concentration of speech signal energy into few coefficients. In scenarios where the bulk of energy is channeled into a limited number of coefficients, a succinct set of features would aptly capture the distinct attributes of speakers [18, 19].
(1)
where N is the number of subcarriers, 0≤n≤N−1, and
4.1.2 Discrete Sine Transform (DST).
The DST similarly aligns with the Fourier-related transform category. Corresponding to the imaginary component of the DFT conducted on a dataset nearly twice its length, the DST operates on real data, and it is distinguished by odd symmetry. This choice stems from the principle that the Fourier transform of a real and odd function results in an imaginary and odd function. Variants of the DST might also involve shifting input and/or output data by half a sample. Mathematically, for a given sequence x(n), the DST is defined as [20]:
(3)
4.1.3 Discrete Wavelet Transform (DWT).
Wavelet transform, as a mathematical procedure, facilitates the partitioning of an audio signal into different sub-bands of varying scales, enabling the independent study of each scale. The DWT is built on the principle of segregating a signal into two key components of low-frequency (approximation) and high-frequency (details) natures, respectively. This involves subjecting the speech signal to a low-pass filter yielding the approximation signal, and a high-pass filter producing the detail signal. Both of these resulting signals hold potential for modeling the characteristics of the speech signal. A graphical depiction of the wavelet transform is given in Fig 3 [21, 22].
4.2 MFCCs
Human speech encapsulates a plethora of speaker-specific attributes, highly valued as discriminative attributes that can be exploited in the recognition process. Among the most prominent low-level features, MFCCs stand out. The generation of speech is characterized by a filter model that represents the vocal tract through its impulse response h(n) and an input source e(n). This process is illustrated in Eq (6),
(6)
where s(n) signifies the speech signal formed by convolving e(n) and h(n) within the temporal domain [23].
In the process of speech production, a substantial volume of data is generated. While a portion of this data embodies crucial speaker-specific attributes, a significant portion is deemed superfluous. The fundamental objective of feature extraction revolves around minimizing data size, while preserving solely the speaker-discriminative information. Within this context, the vocal tract is responsible for the spectral envelope, governing low spectral variations, whereas the excitation source governs spectral nuances, entailing high spectral variations [24]. In an ASI, the spectral envelope has a paramount significance over the details, as it holds the most distinguishing features. Consequently, the isolation of the spectral envelope from the details has a pivotal importance. This separation between the vocal tract and the excitation source is effectively accomplished through cepstrum evaluation [24].
Taking FFT of Eq (6),
(8)
The logarithm maps the multiplication into addition as follows [24]:
(9)
By translating multiplication into addition, a seamless separation of E(ω) from H(ω) is facilitated, especially post IFFT application, where the operation is executed on individual terms. This action yields what is known as the cepstrum domain. In this domain, frequency maps to quefrency. E(ω), the excitation spectrum, corresponds to high spectral variations (details) predominantly found in high quefrency, while H(ω), the vocal tract, accounts for low spectral variations (envelope) present at low quefrency. Evidently, research has validated the information-rich nature of the speech spectrum envelope compared to its details [25].
Within this context, MFCCs emerge as the preferred choice due to their superior alignment with the human auditory system response [25]. This alignment is achieved through the Mel-scale, which takes into consideration the frequency bands of the auditory system. Human auditory system does not perceive frequencies beyond 1 kHz linearly; instead, it adheres to a logarithmic scale above this threshold while maintaining linearity below. To bridge this, the MFCCs method employs two kinds of filters: linear-spaced filters below 1 kHz and logarithmic-spaced filters above 1 kHz [26–28]. Computation of MFCCs centers on short-term analysis, following a standardized procedure. It entails the initial framing and windowing of speech signals, followed by FFT computation. The resultant spectrum is then transformed into the Mel scale [27]. Subsequent steps involve applying the logarithm to the scaled spectrum and performing the DCT, as outlined in Fig 4.
4.3 Polynomial coefficients
The attained MFCCs, in themselves, prove insufficient for comprehensive information extraction. Thus, the integration of polynomial coefficients with them serves to bolster the system resilience against discrepancies encountered during the matching process. It is through these polynomial coefficients–encompassing attributes like curvature, mean, and slope–that the core insights are gleaned from the cepstral coefficients. Remarkably, the temporal profiles of specific cepstral coefficient sets consistently demonstrate analogous behaviors in both training and testing, despite variations in coefficient amplitudes across these stages. This underscores the constancy in the temporal forms of selected cepstral coefficients from training to testing [29].
Extending the cepstral coefficients’ scope involves employing orthogonal polynomial-based time waveform modeling, which, in turn, enables the calculation of polynomial coefficients. The embodiment of these orthogonal polynomials assumes the following mathematical expressions:
(10)
(11)
The modeling of MFCCs is formed using a nine-element window for each MFCC. The polynomial coefficients are given by:
(12)
(13)
Here, aj(t) pertains to the slope, while bj(t) represents the curvature within the MFCCs time functions. The resultant feature vector encompasses aj(t), bj(t), and cj(t) representing the MFCCs.
Consciously, the extraction of features involves seven distinct methodologies encompassing:
- Features sourced from the speech signals.
- Features derived from the DWTs of the speech signals.
- Features obtained from both the speech signals and their associated DWTs.
- Features derived from the DCTs of the speech signals.
- Features derived from both the speech signals and their associated DCTs.
- Features originating from the DSTs of the speech signals.
- Features obtained from both the speech signals and their associated DSTs.
This technique is embraced during the testing phase to emulate the performance of the human auditory system when handling degraded speech. The evaluation of the ASI system performance is gauged through recognition rates stemming from different signal transforms. The recognition rate is expressed as follows:
(14)
Speaker-specific information contained within speech signals can be categorized into two distinct types: low-level information, delineated by the anatomical structure of the vocal tract; and high-level information, defined by learned behavioral habits and styles. Remarkably, the human brain possesses the capacity to distinguish individuals based on these high-level attributes, encompassing prosody, linguistic nuances, phonetic distinctions, emotional cues, language preferences, dialect, and lexical choices. When encountering an unfamiliar voice, a human can often identify the speaker by analyzing these attributes.
In contrast, the ASI system, a machine learning entity, processes speech information using low-level features rooted in physical traits like the larynx and vocal tract. These features represent distinct speech and speaker-dependent vocal tract configurations. Given that variations in the shape and size of the vocal tract and laryngeal tract result in speaker-specific information embedded in the speech signals, constructing a speaker identification system founded solely on behavioral traits becomes unfeasible. Hence, an ASI system founded upon low-level features stands as a more practical tool.
5. Classification process
The process of identification unfolds in a two-fold manner: encompassing speaker training (modeling) and speaker matching stages. During the training or modeling phase, an individual model is constructed for each speaker based on features extracted from his spoken utterances, and subsequently stored within a database. In the subsequent matching stage, when an unidentified speaker provides utterances, akin features to those garnered during training are extracted from the provided speech segment. Subsequently, the generated model is juxtaposed against models housed within the database, facilitating the identification of the best-matched model for the unknown speaker, thereby informing the ultimate decision.
Different classifiers can be used in this identification process, including Gaussian Mixture Models (GMMs), Hidden Markov Models (HMMs), Vector Quantization (VQ), Support Vector Machines (SVMs), and Artificial Neural Networks (ANNs). Within this context, the employment of ANNs is prominent [30, 31].
5.1 Artificial Neural Network (ANN) classifier
ANNs serve as simulation models for the human brain functions, emulating the brain capacity to perform complex tasks by processing data in a manner akin to human cognition [29, 30]. Structured with an assembly of numerous simple processing units known as neurons, ANNs are interlinked through connections denoted as weights. This arrangement follows an organizational framework comprising an input layer, potentially multiple hidden layers, and an output layer. Each layer is composed of cells, with these cells interconnected by weights that facilitate the flow of information from input through hidden layers to the output layer.
Training ANNs hinges on weight adjustments between neurons. The learning process can take the form of supervised learning, in which the network is presented with an input and the corresponding desired output. Alternatively, unsupervised learning, also termed self-organized learning, necessitates input alone, prompting the network to independently adapt based on the input data. Reinforcement learning is yet another approach where the network fine-tunes its weights in response to input data until the accurate output is achieved.
5.2 ANN computations
Upon introducing the input pattern to the input neurons, the activations of all neurons are computed. The learning process involves adjusting the weight strengths until the network effectively learns to compute a specific function mapping input to output, or autonomously classify input data. This unidirectional flow from input to output is known as feed-forward propagation, with the network devoid of feedback. Conversely, in feedback propagation networks, output-to-input feedback is present.
Each neuron update follows a two-step process: first, computation of the net input for the neuron is executed; subsequently, the activation output is calculated based on this net input. If we denote an m-element vector as x = [x1, x2, x3, …., xm], it serves as the input to the neuron. Through multiplication by weights w11, w12, w13, ….., w1m, the net input to the activation function v is generated, as depicted in Fig 5 [31].
Here, xi denotes input data, bk represents the bias, and wji signifies the weight originating from unit i to j. Subsequently, the net input is employed as the argument for the activation function. Upon computing the net input, the activation output is determined through a function dependent on vj. Additionally, within this context, f denotes the activation function, y stands for the neuron output, and b serves as the bias contributing to the refined transformation of the output vj.
6. Speech quality measurements
The clarity of speech hinges on the quality of both hearing and comprehending the spoken words, encompassing the accurate perception of verbal content. In numerous speech processing contexts, enhancing speech quality involves gauging the improvement in a specific portion of speech. This assessment is facilitated through speech quality metrics that fall into two primary categories: subjective and objective evaluations.
Subjective quality metrics are rooted in the perspective of listeners, who engage in a comparison between the original speech and the processed version. Consequently, speech quality is ascertained based on listeners’ perception, and a comprehensive evaluation emerges from the aggregation of results across multiple listeners. Contrarily, objective speech quality metrics depend on quantifiable measurements.
Objective metrics for speech quality are deduced from both the unaltered and impaired speech signals, employing mathematical formulations. These metrics offer efficiency and expedience, given their independence of listener involvement. Noteworthy objective speech quality metrics encompass Signal-to-Noise Ratio (SNR) and segmental Signal-to-Noise Ratio (SNRseg) [32].
6.1 Signal-to-Noise Ratio
The SNR, which stands as the oldest and extensively employed objective metric, is characterized by the following equation:
(18)
In this equation, x(i) denotes the original speech, y(i) represents the impaired speech, and i corresponds to the sample index. Calculating the SNR involves straightforward mathematical steps, yet it necessitates access to both pristine and corrupted speech samples.
6.2 Segmental SNR
The SNRseg gives the SNR over short frames, and then the average is estimated.
In this context, where N signifies the frame length, typically falling within the range of 15 to 20 ms, and M denotes the count of frames within the speech signal, x(i) pertains to the initial speech, and y(i) stands for the altered speech [33].
7. Proposed systems
Three systems are presented in this section: a reverberation-affected speaker identification system, a system depending on cancelable features obtained through comb filtering, and a novel cancelable speaker identification system within reverberation environments.
7.1 Conventional speaker identification
In this sub-section, the conventional speaker identification system is presented as a benchmark, in which the following steps are performed as shown in the Fig 6.
- Feature extraction from the voice signals for training. Then, the model created using the neural network is saved in the database (Training mode).
- Feature extraction from the unknown speaker voice signal. Then, matching with all speaker models in the database is performed for identification (Testing mode).
7.2 Proposed speaker identification system in the presence of reverberation
In this sub-section, we a present a speaker identification system in the presence of reverberation, in which the following steps are performed as shown in the Fig 7.
- Feature extraction from the voice signals for training. Then, the models created using the neural network are saved in the database (Training mode).
- Feature extraction from the reverberant speech signals (unknown speaker voices passed through comb filter), and then matching is performed with all speaker models in the database for identification, and then the decision is made (Testing mode).
7.2.1 Reverberation modeling.
The reverberation can be modeled with a comb filter that is applied on the original speech signal. It is, in fact, a multi-band filter represented as [8]:
(20)
The discrete-time representation of this equation is given by:
(21)
where L is the filter length, which is proportional to the reverberation time. Both magnitude and phase responses of the comb filter of order 8 are given in Fig 8.
(22)
where x(n) refers to the input speech signal, h(n) indicates impulse response of the comb filter shown in Fig 9, and y(n) is the reverberant output.
7.3 Proposed cancelable speaker identification system
In this sub-section, we present a speaker identification system using cancelable features with comb filter as a distortion tool. In this case, both training and testing are performed with the comb filter effect as a tool for inducing distorsion as shown in the Fig 10.
7.4 Proposed cancelable speaker identification system on the feature level in the presence of reverberation
In this sub-section, we present a cancelable speaker identification system on the feature level in the presence of reverberation. The intended degradation is induced with a comb filter model on the feature level in both training and testing modes as shown in Fig 11.
8. Simulation results and discussion
8.1 Speech database
Initially, a database was assembled, comprising recordings for 15 distinct speakers. Each speaker was tasked with repeating a specific Arabic sentence a total of 10 times. During the training phase, a total of 150 speech samples were employed to derive Mel-Frequency Cepstral Coefficients (MFCCs) and polynomial coefficients, which were subsequently utilized to construct the feature vectors for the database.
In the testing phase, each of the aforementioned speakers was prompted to recite the designated sentence once more, after which their speech signals underwent a degradation process. From these degraded speech signals, comparable features to those utilized during training were extracted. These features were then employed for the matching process.
The features consist of 13 MFCCs and 26 polynomial coefficients, collectively composing feature vectors comprising 39 coefficients for every frame within the speech signal. The speech signals have a sampling frequency of 18,000 samples per second. The speech database is summarized in Table 1.
This paper delves into the analysis of speech signals in environments marked by indoor noise such as home noise. The source of noise comes from interference from another speaker, or from the surrounding environment, and modeled as Additive White Gaussian Noise (AWGN). In this work, when the speech signal is corrupted with noise, it is processed by means of considered transforms such as the DCT, DST, and DWT.
Various simulation experiments have been executed to rigorously test the proposed systems for speaker identification and cancelable speaker identification. The assessment encompassed diverse feature extraction schemes, including:
- Features derived directly from speech signals.
- Features extracted from the DWTs of speech signals.
- Features obtained from both speech signals and their corresponding DWTs.
- Features derived from the DCTs of speech signals.
- Features obtained from both speech signals and their corresponding DCTs.
- Features originating from the DSTs of speech signals.
- Features derived from both speech signals and their corresponding DSTs.
Table 2 presents the number of epochs required for training the neural networks for the different feature extraction schemes. The representation of recognition rate versus SNR is visually depicted in Figs 12 to 15 and substantiated with data presented in Tables 3 to 6.
Figs 12 and 13 illustrate how the recognition rate of the speaker identification system changes with SNR for various feature extraction techniques, excluding and including the impact of reverberation, respectively. The obtained results are compared with the results presented in [16]. Two different approaches are used for comparison depending on GMM-GFCC MLC and GMM-GFCC supervector SVM.
According to the obtained results, it is evident that the performance of all schemes experiences improvement as the SNR increases. Furthermore, the scheme based on wavelet domain consistently delivers the most robust performance. This superiority can be attributed to the innate ability of the wavelet transform to decompose signals into sub-bands, enhancing the system ability to capture essential features. It is clear also that the proposed method outperforms the other approaches [16], specially at low SNR.
Figs 14 and 15 present the variation of the output recognition rate of the cancelable speaker identification system with SNR for different feature extraction techniques without/with reverberation effect, respectively.
Conversely, within the realm of cancelable speaker identification systems presented in Figs 14 and 15, our findings underscore that DCT-based features outshine others in terms of performance. This can be attributed to the remarkable resilience of few selected DCT coefficients to the distortions introduced by the comb filter. This resilience is a result of the energy compaction property intrinsic to DCT. It is evident that the proposed method consistently outperforms the other approaches [16], particularly under low SNR conditions.
As indicated by Table 1, the obtained results were obtained considering a reverberation time (TR) of 0.5 s. The effects of changing reverberation time can be described as follows; longer reverberation times can degrade speech quality and make it more challenging to recognize speakers, accurately. The increased presence of reflections and echoes can introduce additional acoustic variability, leading to a decrease in Recognition Rate (RR). Shorter reverberation times, on the other hand, indicate less reflection and echo in the environment. This can lead to cleaner speech signals, making it easier for speaker recognition systems to operate with higher accuracy, and, thus, potentially result in improved recognition rates.
Tables 3–6 summarize the obtained results presented in Figs 12–15, respectively. The results highlight that all systems exhibit improved performance with SNR increase. Wavelet-domain features consistently outperform other features in speaker identification systems, regardless of the presence of reverberation, owing to their sub-band decomposition capability (Tables 3 and 4). In contrast, in cancelable speaker identification systems (Tables 5 and 6), DCT-based features enhance performance due to the exceptional resilience of specific DCT coefficients to distortions induced by the comb filter, a trait attributed to the DCT inherent energy compaction property.
9. Conclusion
This paper has shed valuable light on the performance dynamics of various speaker identification systems, notably in the presence of challenging acoustic factors such as reverberation and noise. It is evident that the SNR plays a pivotal role in influencing the performance of these systems, with higher SNR levels consistently yielding enhanced results. Specifically, our analysis reveals that, within the realm of speaker identification systems, both in the absence and presence of reverberation effects, wavelet-domain features emerge as the top-performing choice. This superiority can be attributed to the inherent sub-band decomposition capabilities offered by the wavelet transform. The decomposition into different frequency scales enables a more robust representation of speech features, making it particularly resilient in challenging acoustic environments. In contrast, for the cancelable speaker identification system, our findings demonstrate that features based on DCT deliver the most favorable performance. This can be attributed to the remarkable ability of a select few DCT coefficients to withstand the distortions introduced by the comb filter, thanks to the energy compaction property inherent to the DCT.
10. Future work
Future work can focus on further refining of cancelable speaker identification techniques, potentially exploring advanced signal processing methods and expanding the scope to address emerging challenges in biometric security. The effects of outdoor noise such as car and street noise can also be stuied. Additionally, investigating the adaptability of these systems to real-world scenarios and exploring the integration of emerging technologies, such as deep learning, holds promise for continued advancements in the field. Furthermore, exploring the integration of cutting-edge technologies, particularly deep learning, offers a promising avenue for further advancements. Deep learning models, with their capacity for feature extraction and pattern recognition, can potentially revolutionize speaker identification systems, enhancing accuracy and robustness.
Moreover, as the landscape of biometric security evolves, future work should address emerging challenges, such as adversarial attacks and multimodal authentication, to ensure comprehensive protection against evolving threats. Collaborative research efforts and interdisciplinary approaches could also unlock novel avenues, encompassing fields like acoustic forensics and human-computer interaction.
Acknowledgments
The authors extend their appreciation to the Deputyship for Research & Innovation, Ministry of Education in Saudi Arabia for funding this research work through the project number ISP23-56.
References
- 1. Pentapati H. K. and S. K, "Dilated Convolution and MelSpectrum for Speaker Identification using Simple Deep Network," 2022 8th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 2022, pp. 1169–1173,
- 2. Loina L., "Speaker Identification Using Small Artificial Neural Network on Small Dataset," 2022 International Conference on Smart Systems and Technologies (SST), Osijek, Croatia, 2022, pp. 141–145,
- 3. Mu X. and Min C. -H, "MFCC as Features for Speaker Classification using Machine Learning," 2023 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, 2023, pp. 0566–0570,
- 4. Bader M., Shahin I., Ahmed A. and Werghi N., "Hybrid CNN-LSTM Speaker Identification Framework for Evaluating the Impact of Face Masks," 2022 International Conference on Electrical and Computing Technologies and Applications (ICECTA), Ras Al Khaimah, United Arab Emirates, 2022, pp. 118–121,
- 5. Das A., Roy L. P. and Kumar Das S., "Effectiveness of Feature Collaboration in Speaker Identification for Voice Biometrics," 2023 International Conference on Computer, Electrical & Communication Engineering (ICCECE), Kolkata, India, 2023, pp. 1–4,
- 6. Hasan Abdulqader A., AbdulRahman Al-Haddad S., Abdo S., Abdulghani A. and Natarajan S., "Hybrid Feature Extraction MFCC and Feature Selection CNN for Speaker Identification Using CNN: A Comparative Study," 2022 2nd International Conference on Emerging Smart Technologies and Applications (eSmarTA), Ibb, Yemen, 2022, pp. 1–6,
- 7. Prachi N. N., Nahiyan F. M., Habibullah M. and Khan R., "Deep Learning Based Speaker Recognition System with CNN and LSTM Techniques," 2022 Interdisciplinary Research in Technology and Management (IRTM), Kolkata, India, 2022, pp. 1–6,
- 8. Chen C., Sun W., Harwath D. and Grauman K., "Learning Audio-Visual Dereverberation," ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 2023, pp. 1–5,
- 9. Reddy Gade V. S. and Sumathi M., "Hybrid Deep Convolutional Neural Network based Speaker Recognition for Noisy Speech Environments," 2023 2nd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, 2023, pp. 920–926,
- 10. Kareem M., Saleeb A., El-Dolil S. M., El-Fishawy A., Abd El-Samie F. E. and Dessouky M. I., "Efficient Comb-based Filter for Cancelable Speaker Identification System," 2021 International Conference on Electronic Engineering (ICEEM), Menouf, Egypt, 2021, pp. 1–7,
- 11.
H. Sawada, R. Ikeshita, K. Kinoshita and T. Nakatani, "Multi-frame Full-rank Spatial Covariance Analysis for Underdetermined Blind Source Separation and Dereverberation," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, https://doi.org/10.1109/TASLP.2023.3313446
- 12. Cuji D. A., Li Z. and Stojanovic M., "Joint Beamforming and Tracking for Multi-user Acoustic Communications," OCEANS 2023—Limerick, Limerick, Ireland, 2023, pp. 1–4,
- 13. Neyazi Badawi, Mahmoud Adel Zaghloul, Seddeq H. S., El-Samie Fathi I. Abd, et al “Text-dependent and text-independent speaker recognition of reverberant speech based on CNN”, International Journal of Speech Technology (IJST), vol. 10772- no. 9805, pp. 1–15, 2021.
- 14. Yan J., Li Q. and Duan S., "A Simplified Current Feature Extraction and Deployment Method for DC Series Arc Fault Detection," in IEEE Transactions on Industrial Electronics, vol. 71, no. 1, pp. 625–634, Jan. 2024,
- 15. Pan L., He C. and Chang T., "External-Attentive Statistics Pooling for Text-Independent Speaker Verification," 2023 IEEE 3rd International Conference on Computer Communication and Artificial Intelligence (CCAI), Taiyuan, China, 2023, pp. 301–305,
- 16. Al-Qaderi M, Lahamer E, Rad A. A Two-Level Speaker Identification System via Fusion of Heterogeneous Classifiers and Complementary Feature Cooperation. Sensors (Basel). 2021 Jul 28;21(15):5097. pmid:34372334; PMCID: PMC8347650.
- 17. Zeidan D. E. B., Noun A., Nassereddine M., Charara J. and Chkeir A., "Feature Extraction And Machine Learning Classifiers For Elderly Speech Recognition In Comprehensive Geriatric Assessment Cga Questionnaires," 2023 5th International Conference on Bio-engineering for Smart Technologies (BioSMART), Paris, France, 2023, pp. 1–4,
- 18. Ito I., "Convolution Using Discrete Cosine Transforms for Improving Performance of Convolutional Neural Networks," 2022 30th European Signal Processing Conference (EUSIPCO), Belgrade, Serbia, 2022, pp. 1462–1466,
- 19. Popov O. B., Chernysheva T. V., Orlov K. V. and Sapronov P. S., "Algorithm for the Complex Discrete Cosine Transform," 2022 Intelligent Technologies and Electronic Devices in Vehicle and Road Transport Complex (TIRVED), Moscow, Russian Federation, 2022, pp. 1–7,
- 20. Kober V., "Fast Hopping Discrete Sine Transform," in IEEE Access, vol. 9, pp. 94293–94298, 2021,
- 21. Rana M. S., Hasan M. M. and Sinha Shuva S. K., "Digital Watermarking Image Using Discrete Wavelet Transform and Discrete Cosine Transform with Noise Identification," 2022 2nd International Conference on Intelligent Technologies (CONIT), Hubli, India, 2022, pp. 1–4,
- 22. Odarchenko R., Lavrynenko O., Bakhtiiarov D., Dorozhynskyi S. and Zharova V. A. O., "Empirical Wavelet Transform in Speech Signal Compression Problems," 2021 IEEE 8th International Conference on Problems of Infocommunications, Science and Technology (PIC S&T), Kharkiv, Ukraine, 2021, pp. 599–602,
- 23. Pal R., "Speech Compression with Wavelet Transform and Huffman Coding," 2021 International Conference on Communication information and Computing Technology (ICCICT), Mumbai, India, 2021, pp. 1–4,
- 24. Patil H. A., "Combining Evidences from Variable Teager Energy Source and Mel Cepstral Features for Classification of Normal vs. Pathological Voices," 2019 27th European Signal Processing Conference (EUSIPCO), A Coruna, Spain, 2019, pp. 1–5,
- 25. Mokgonyane T. B., Sefara T. J., Manamela M. J. and Modipa T. I., "The Effects of Data Size on Text-Independent Automatic Speaker Identification System," 2019 International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD), Winterton, South Africa, 2019, pp. 1–6,
- 26. Mokgonyane T. B., Sefara T. J., Modipa T. I., Mogale M. M., Manamela M. J. and Manamela P. J., "Automatic Speaker Recognition System based on Machine Learning Algorithms," 2019 Southern African Universities Power Engineering Conference/Robotics and Mechatronics/Pattern Recognition Association of South Africa (SAUPEC/RobMech/PRASA), Bloemfontein, South Africa, 2019, pp. 141–146,
- 27. Winursito A., Hidayat R. and Bejo A., "Improvement of MFCC feature extraction accuracy using PCA in Indonesian speech recognition," 2018 International Conference on Information and Communications Technology (ICOIACT), Yogyakarta, Indonesia, 2018, pp. 379–383,
- 28. Firmansyah M. R., Hidayat R. and Bejo A., "Comparison of Windowing Function on Feature Extraction Using MFCC for Speaker Identification," 2021 International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA), Bandung, Indonesia, 2021, pp. 1–5,
- 29. Choudhary H., Sadhya D. and Patel V., "Automatic Speaker Verification using Gammatone Frequency Cepstral Coefficients," 2021 8th International Conference on Signal Processing and Integrated Networks (SPIN), Noida, India, 2021, pp. 424–428,
- 30. Zhou Z., Zhang Y. and Duan Z., "Joint Speaker Diarization and Recognition Using Convolutional and Recurrent Neural Networks," 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 2018, pp. 2496–2500,
- 31. Huang C. L., "Exploring Effective Data Augmentation with TDNN-LSTM Neural Network Embedding for Speaker Recognition," 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 2019, pp. 291–295,
- 32. Soliman N.F., Mostfa Z., El-Samie F.E.A. et al. Performance enhancement of speaker identification systems using speech encryption and cancelable features. Int J Speech Technol 20, 977–1004 (2017). https://doi.org/10.1007/s10772-017-9435-z.
- 33. Farhati A., Aicha A. B. and Bouallegue R., "On the strengthening of the speech encryption schemes for communication systems based on blind source separation approach," 2018 14th International Wireless Communications & Mobile Computing Conference (IWCMC), Limassol, Cyprus, 2018, pp. 108–111,