Target sample mining with modified activation residual network for speaker verification

Ji Chaoqun; Chen Wei; Ye Peng; Wang Zhou; Zhou Shuhang

doi:10.1371/journal.pone.0320256

Abstract

In the domain of speaker verification, Softmax can be used as a backend for multi-classification, but traditional Softmax methods have some limitations that limit performance. During the training phase, Softmax is used for multi-class training, while the speaker verification stage is a binary classification validation, leading to a discrepancy between the multi-class training in the training phase and the binary classification validation in the verification stage. It is also important to notice the issue of the disparity in the number of positive and negative samples in the sampling process of a binary classification problem. The difference in positive and negative samples can lead to the dominance of negative sample gradients during machine learning training, which can affect the performance of the speaker verification system. During the process of calculating similarity between positive and negative samples, there may be encountered an issue of overlapping similarity scores. If the overlapping portion is too large, it can reduce the discriminability between positive and negative samples, affecting the speaker system’s ability to distinguish between positive and negative samples. Considering the relatively compact distribution of positive and negative sample spaces, it is beneficial for enhancing the performance of the speaker system, and focusing more on the learning of difficult samples is conducive to improving the network’s convergence and generalization. Thus, this paper introduces an adaptive target function capable of solving these issues (SphereSpeaker). SphereSpeaker introduces different types of hyperparameters on the basis of Softmax, making it more suitable for handling speaker verification problems. SphereSpeaker also introduces three different angular margins to update the network, further enhancing the stability and generalization ability of the network model. Meanwhile, considering the issues of gradient vanishing, gradient explosion, and model degradation that can occur in deep neural networks, this paper introduces a deep neural network, which is named as Residual Network PReLu(ResNet-P). The experimental results indicate that compared to other deep neural network methods, this method has the lowest equal error rate, significantly improving the performance of the speaker verification system.

Citation: Chaoqun J, Wei C, Peng Y, Zhou W, Shuhang Z (2025) Target sample mining with modified activation residual network for speaker verification. PLoS ONE 20(4): e0320256. https://doi.org/10.1371/journal.pone.0320256

Editor: Qian Zhang, Jiangsu Open University, CHINA

Received: August 28, 2024; Accepted: February 14, 2025; Published: April 16, 2025

Copyright: © 2025 Chaoqun et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data underlying the results presented in the study are available from https://www.robots.ox.ac.uk/~vgg/data/voxceleb/.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

With the rapid development of deep learning, deep neural networks have achieved good performance in many fields. In the speaker verification domain, the Deep Vector (D-Vector) method [1] has pushed the application of deep learning theories to a climax. It inputs the speaker’s voice frames into a Deep Neural Network (DNN) and obtains frame-level embedding features. The X-Vector method [2] extracts speaker features through Time-Delay Neural Networks (TDNN) [3]. Due to the network’s time delay characteristics, it can learn more frame-level correlations during the learning process, thereby increasing the robustness of the recognition system. With the performance improvement brought by the X-Vector method, more network architectures have been applied in speaker verification tasks. The Visual Geometry Group-Middle (VGG-M) network [4], initially applied in the field of image processing, consists of 5 convolutional layers and 3 fully connected layers. Due to its outstanding performance in image processing, it has attracted attention from various fields and has been applied in the feature extraction phase of speaker verification tasks [5]. Deep Residual Networks (ResNet) [6] can directly pass shallow data to deep networks, which is beneficial for gradient optimization and speeds up the training efficiency of the network.

The objective function is crucial for the establishment of deep learning models. The most common objective function initially aimed at classification [7]. This type of objective function mainly focuses on research from two perspectives based on Softmax loss [8]. One is to enhance its discriminative ability by increasing the distance between decision boundaries of different classes, including its variants such as Additive Margin Softmax (AM-Softmax) loss [9], Dynamic Additive Margin Softmax [10], and Additive Angular Margin Softmax (AAM-Softmax) loss [11], among others. The second method is to enhance the discriminability of the Softmax loss through regularization, which typically establishes a connection between the regularizer and the Softmax loss in a weighted form. The regularizers used are usually loss functions that can be used independently, such as Center loss [12], Ring loss [13], etc. Common objective functions aimed at metric learning include binary cross-entropy loss [14], contrastive loss [15], triplet loss [16], quadruplet loss [17], and objective functions based on mutual information Adaptive Estimation(MIAD) [18], among others. With the research and development of sampling techniques, methods that optimize solely for metric learning can also achieve ideal performance, with effects similar to those of combining classification with metric learning [19].

Objective function based on sample mining

The main research is for a target function of sample mining, which is an improvement based on the Softmax loss function targeting multi-classification. Research has found that the Softmax method has some restrictive drawbacks [20]. The Softmax function is used within a multi-class framework during training, whereas the speaker verification phase is a classic binary classification problem, which can lead to discrepancies between the training phase and the verification phase. The multi-classification method during the training process increases the difficulty of data collection due to the need to determine the category of each sample. For the aforementioned issues, this section introduces a binary classification objective function. This method can effectively bridge the gap between the training phase and the validation phase, enhancing the system’s performance.

Binary classification objective function

To address the differences between the training phase and the validation phase, K binary classification tasks were constructed, with K being the number of speakers in the training set. In the ith binary classification, positive samples use the ith audio sample, while negative samples use other audio samples. A basic binary classification objective function is shown as follows:

(1)

Where: is the weight of the i-th binary classifier, x is the feature of the speaker, y is the corresponding label, and is the bias. To transform the problem into an unconstrained space for classification, normalize the binary classifier weights and speaker features x. The transformed objective function is shown as follows:

(2)

which: is the angle between the weights of the i-th binary classifier and the speaker features x. The speaker confirms that the task is an open set, while the bias in the equation (1) formula pertains to closed set learning, thus the bias is usually removed. The purpose of this binary classification objective function is to minimize , thereby reducing the loss. However, there are certain issues with binary classification targets during the process of speaker verification tasks. The following chapters will analyze these problems one by one.

Balancing positive and negative samples problem

In the previous section, we defined K as the number of speakers in the training set. Multi-speaker methods such as triplet loss, contrastive loss, and Softmax loss all address the issue of positive and negative sample balance. Balancing positive and negative samples is to address the issue of highly imbalanced gradients caused by the uneven quantity of positive and negative samples. An imbalanced number of positive and negative samples can have a negative impact on the network.

During the speaker verification process, each batch selects K speakers, resulting in only 1 positive sample, while the number of negative samples is K-1. This disparity in quantity causes the gradient of the negative samples to dominate, reducing the performance of the model. Consider introducing a weighting factor λ to balance the gradients of positive and negative samples, with the following updated expression:

(3)

which: is a hyperparameter used to balance the gradients of positive and negative samples. λ Can be defined according to the number of K speakers selected for each batch, which can be represented as .

Difficulty sample mining problem

Difficulty sample mining has always been one of the core issues in the research of deep learning. In the speaker verification task, hard samples are those with lower scores within pairs of samples from the same speaker and those with higher scores within pairs of samples from different speakers. Similarly, easy samples are those with higher scores within pairs of samples from the same speaker and lower scores within pairs of samples from different speakers. The mining of easy and hard samples has a positive impact on neural networks; the loss of hard samples in the objective function is higher than that of easy samples. Therefore, it is considered to optimize the objective function to make the model pay more attention to hard samples, which is beneficial for improving the network’s convergence and generalization.

Study the hard and easy sample mining methods for Softmax loss, define the normalized Softmax [21] loss as , where s is an adjustable parameter, label . Fixed , defined as hard samples, as easy samples , it can be known through numerical calculations that as the adjustable parameter s increases, the loss of hard samples is higher and more sensitive compared to easy samples. Introducing the parameter s enables the neural network to focus on optimizing difficult samples, enhancing the convergence and generalization of the neural network. Apply the aforementioned method to the objective function proposed in this chapter, introducing a hyperparameter r to adjust the focus of the objective function on difficult samples. The improved objective function is represented as:

(4)

which: the larger the hyperparameter r, the higher the attention given to difficult samples. Set the parameters λ in equation (4) to 1, define as easy samples, as hard samples, adjust as well as the hyperparameter r. Through numerical calculations, it can be seen that as the hyperparameter r increases, the loss of easy samples approaches zero, while the loss of hard samples remains basically unchanged.

Margin adjustment issue

In the speaker’s confirmation task, the number of similar samples in each batch is much smaller than the number of dissimilar samples, thus the space occupied by dissimilar samples is greater than that of similar samples. In the training process, it is necessary to allocate more space for heterogeneous samples, which is beneficial for the stability and flexibility of the model. Therefore, the margin m is introduced, and the objective function is as follows:

(5)

Wherein: b is a learnable parameter of deviation, the purpose of which is to improve the stability of training. After the improvement, the positive sample boundary is changed to ,and the negative sample boundary is changed to .

Similarity adjustment

During the training process, it was discovered that there is a difference in the distribution of cosine similarities between positive and negative sample pairs. Among them, the negative samples are more concentrated in the cosine similarity distribution, while the positive samples are more dispersed in the similarity distribution. The difference in the distribution of similarity can lead to an overlap between positive and negative sample pairs in terms of similarity scores, which can affect the distinction between positive and negative sample pairs, and is detrimental to training. Considering the aforementioned issues, a monotonically decreasing function is introduced to replace the cosine function. Its expression is as follows:

(6)

Wherein: , . t is a parameter that can control the overlap of positive and negative sample pairs, as t increases, the overlap of positive and negative sample pairs decreases. It is noteworthy that when , . The modified similarity calculation method still has its value range within , but it has solved the problem of overlapping positive and negative sample pairs, increasing the discriminability between positive and negative samples, which facilitates model learning. The objective function after the final modification can be represented as:

(7)

Adaptive objective function with different angular margins

Added margin adjustment for homogeneous and heterogeneous samples to increase the stability of training. We will compare the margins of other methods, and the final objective function is shown as follows:

(8)

In equation (8), the use of learnable parameter b to adjust the margin for similar and dissimilar samples can be considered as additive margins. The same approach can also incorporate another type of additive margin [9] and integrate gradient separation methods, making the training process relatively stable. The objective function represented by this method can be expressed as:

(9)

Among which: is the gradient separation function, which allows some network parameters to not participate in the network parameter update, reducing the impact of the branch network on the main network. Similarly, multiplication margins can also be introduced into the objective function [22,23], and the objective function with multiplication margins can be represented as:

(10)

ResNet-P

A distinctive feature of residual networks is that they have many “residual units”, which can be represented as follows:

(11)

Where: is the input speaker features of the l -th residual unit, is the parameters of the l -th residual unit, K is the number of layers of the residual unit, and is the residual function. can be considered as a direct mapping, that is: . Define the -th input of the residual unit at layer as the speaker feature , which can be represented as:

(12)

Which: The original method used the activation function Rectified Linear Unit(ReLU) [24], the purpose of ReLU is to increase the non-linear connections between neurons in the neural network. This chapter introduces the Parametric Rectified Linear Unit (PReLU) [25] on this basis. PReLU adds a learnable parameter to ReLU, allowing the activation function to adaptively learn according to the true state of the features, which is beneficial for enhancing the network’s ability to represent features and accelerating the convergence rate of the network. The calculation method of PReLU is as follows:

(13)

Where: represents the residual unit at layer l, and is a learnable parameter that controls the slope of the negative part. uses momentum update, the expression is as follows:

(14)

Wherein: μ set momentum to 0. 9, η set the learning rate to 0. 005, and σ is the target function, with an initial value of 0. 25. It is worth noting that when the learnable parameters , the above equation becomes ReLU. Draw the PReLU and ReLU curves as shown in Fig 1. The coefficients in the third quadrant of the PReLU coordinate graph are not constant; they are learnable parameters. For ease of observation, it is assumed to be a constant in the figure.

Download:

Fig 1. Comparison between ReLU and PReLU.

https://doi.org/10.1371/journal.pone.0320256.g001

Assuming is an identity mapping, that is, equation (12) can be represented as , substituting it into equation (11) yields:

(15)

Obtained through recursive operations in the end:

(16)

Where: is for relatively deep speaker features, is for relatively shallow speaker features, and is for the stacked residual unit. From the above equation, it can be deduced that any deep feature can be represented as a sum of any shallow feature and stacked residual units. We name the residual network with a trainable activation function as the ResNet-P network. This article applies the ResNet-P network to the field of speaker verification.

Experimental analysis

Dataset.

The experiment utilized the large-scale speaker recognition database VoxCeleb1 [5] with varying speech quality. The audio in the database is extracted from YouTube videos, which come from various complex environments and contain various types of noise. The development set of the database contains 148,642 speech audio segments provided by 1,211 speakers (690 males, 561 females). The evaluation set includes 40 speakers outside of the development set categories, totaling 4874 speech samples. During testing, the official test plan list is used, with a total of 37720 tests, and the ratio of non-target to target tests is 1:1.

Procedure.

Process the data next. Since there may be silent segments in the data, it is necessary to first detect the speech and remove the silent parts. Next, feature extraction is performed, with the front-end features using MFCC features with a dimension of 13. Then perform preprocessing operations such as pre-emphasis, framing, and windowing on the voice signal. The pre-emphasis coefficient is set to 0. 97, the window length for windowing is 25ms, the frame shift is 10ms, and the number of points for the FFT is set to 512. After the above operations, a spectral feature of the word can be obtained. The residual network that introduces PReLU is denoted as ResNet-P. The last fully connected layer of ResNet-P has a dimension of 256, and the corresponding embedding feature dimension is also 256-dimensional. The optimization algorithm of ResNet-P uses the Stochastic Gradient Descent (SGD) algorithm, with an initial learning rate of and a final learning rate of .

Performance comparison and analysis

Compare the performance of the ResNet-P network with the three proposed objective functions under different parameters. The three methods mentioned above all use Cosine Distance Scoring (CDS) for speaker matching, abbreviated as ResNet-P+SphereSpeaker-C, ResNet-P+SphereSpeaker-A, and ResNet-P+SphereSpeaker-M, respectively. The comparative methods include the speaker’s confirmation of traditional statistical models versus deep neural network models. Among them, statistical model-based methods include GMM-UBM, I-vector + PLDA. he front-end acoustic features of GMM-UBM are respectively based on Mel-frequency cepstral coefficients (MFCC) features [26], modified power normalized cepstral coefficients (MPNCC) features [27], and features based on affine transformation and feature transformation (ATFS) [27]. The deep neural network models include four speaker recognition systems, each using ResNet34 [6] as the network architecture and targeting different objective functions: contrastive loss, triplet loss, AM-Softmax loss, and MIAD loss [18], abbreviated as ResNet34+Contrastive, ResNet34+Triplet, ResNet34+AM-Softmax, and ResNet34+MIAD, respectively. In addition, the comparative methods also include: CNN-based methods (AutoSpeech) [28], VGG-based networks [29], SincNet networks [30], SimCLR+NN [31], BPCSR [32], BGLCC [33]. The performance evaluation criteria use the Equal Error Rate (EER) and the minimum Detection Cost Function (minDCF), with the parameters for minDCF set to the official values. The lower the values of EER and minDCF, the better the performance. Based on the experimental setup above, the performance comparison of different methods is shown in Table 1.

Download:

Table 1. Performance comparison under different λ, t.

https://doi.org/10.1371/journal.pone.0320256.t001

As can be seen from Table 1:

(1) The ResNet-P+SphereSpeaker-A method performs optimally in terms of parameters compared to other methods. The equal error rate (EER) reached a minimum value of 6. 17, and the minimum detection cost function (minDCF) reached a minimum value of 0. 48.
(2) Under different objective functions, as t increases, both EER and minDCF decrease when the parameters λ are the same. Prove that the parameter t enhances the neural network’s ability to distinguish between the similarity of positive and negative samples.

Parameter selection and analysis

Compare the performance of the ResNet-P network with the three proposed objective functions under different parameters. All three objective functions have two adjustable parameters λ,t, λ is the positive and negative sample equilibrium parameters, t is the similarity adjustment of positive and negative samples. The experimental parameters in this section were controlled by the control variable method, and other parameters except parameters λ and t were fixed, to verify the performance differences of the system under different parameters λ and t. The selection range of the setting parameter is , . Use EER and minDCF as performance evaluation metrics. By arranging and combining two parameters, the experimental results are plotted as shown in Fig 2.

Download:

Fig 2. Performance of different parameters of the method proposed in this paper.

https://doi.org/10.1371/journal.pone.0320256.g002

As can be seen from Fig 2:

(1) Keep the control parameters λ unchanged, adjust the value of parameter t and find that, the values of EER and minDCF for the three methods at are reduced to varying degrees compared to the values of EER and minDCF at , wherein the decline is relatively significant when ,the ResNet-P+SphereSpeaker-A method reduces EER by 1. 43% and minDCF by 0. 09, while the ResNet-P+SphereSpeaker-M method decreases EER by 1. 03% and minDCF by 0. 05. Prove that the parameter t in the proposed objective function plays a role in distinguishing the overlapping parts of positive and negative sample similarities under different angular margins, thereby improving the performance of the speaker system.
(2) When keeping the control parameter t constant, it can be visually seen that the methods ResNet-P+SphereSpeaker-C and ResNet-P+SphereSpeaker-A have a decrease in EER and minDCF values when . As the parameter λ decreases, the minDCF value corresponding to the method ResNet-P+SphereSpeaker-M gradually decreases when , . It can demonstrate that the parameters λ of the three objective functions can effectively address the imbalance between positive and negative samples during the training process, enhancing the accuracy of the speaker verification task.

Convergence comparison and analysis

Conduct a comparative convergence analysis of the three methods proposed in this paper: ResNet-P+SphereSpeaker-C, ResNet-P+SphereSpeaker-A, and ResNet-P+SphereSpeaker-M, with the experimental data settings being the same as in the previous sections. Convergence curves use EER and minDCF as performance evaluation metrics, and all experiments in this paper have an iteration count of 45. Draw convergence curves for the three methods under different parameters, as shown in Fig 3.

Download:

Fig 3. Comparison Chart of Convergence Curves of the Methods Proposed in this Paper.

https://doi.org/10.1371/journal.pone.0320256.g003

As can be seen from Fig 3:

(1) As the number of epoch increases, the error rate (EER) and minDCF for all three proposed methods show a decreasing trend under different parameters. Among them, ResNet-P+SphereSpeaker-C and ResNet-P+SphereSpeaker-A have lower EER, and ResNet-P+SphereSpeaker-A has the lowest minDCF value.
(2) The method proposed in this paper, ResNet-P+SphereSpeaker-A, achieves the lowest minDCF value of 0. 48 when the parameters , . Further demonstrate the positive promoting effect of the method proposed in this paper on the speaker’s task.
(3) The method mentioned in this paper, ResNet-P+SphereSpeaker-C, achieves the lowest EER value of 6. 17% when the parameters , . ResNet-P+SphereSpeaker-A achieves the lowest EER value of 6. 17% when the parameter , . Compared to the same method under different parameters, it demonstrates the effectiveness of parameters λ,t in addressing the gradient imbalance between positive and negative samples and the overlap of similarity between positive and negative samples during the training process of speaker verification tasks.

Conclusion

This article proposes a speaker verification method based on target sample mining with modified activation residual networks, the method uses an adaptive objective function under three different angular margins. The method, while preserving the advantages of deep networks in expressing speaker characteristics, addresses the issues of gradient vanishing, gradient explosion, and model degradation caused by increasing the number of neural network layers. Simultaneously introducing adaptive objective functions under three different angular margins to update the network further enhances the stability and generalization ability of the network model. The experimental results indicate that all three proposed methods can enhance the accuracy of speaker verification tasks, improve the stability and accuracy of deep networks in expressing speaker characteristics, and effectively enhance the network’s representational ability under the supervision of this objective function. From the performance analysis in terms of performance, parameter analysis, and convergence, the method proposed in this paper performs well in speaker verification tasks.

References

1. Variani E, Lei X, McDermott E, Ignacio L M; Javier G D. Deep Neural Networks for Small Footprint Text-Dependent Speaker Verification. IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, 2014: 4052-4056. https://ieeexplore.ieee.org/document/6854363
2. Hinton G, Deng L, Yu D, Dahl G, Mohamed A, Jaitly N, et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process Mag. 2012;29(6):82–97.
- View Article
- Google Scholar
3. Emad SH, Badawi N, Seddeq HS, Adel ZM, Ahmed SO, Atef E. Enhancing speaker identification through reverberation modeling and cancelable techniques using ANNs. PLOS ONE. 2024.
- View Article
- Google Scholar
4. Ruqia B, Zahid M, Asmaa M, Rehan MY, Syed SA. Deep features optimization based on a transfer learning, genetic algorithm, and extreme learning machine for robust content-based image retrieval. PLOS ONE. 2022.
- View Article
- Google Scholar
5. Nagrani A, Chung JS, Zisserman A. VoxCeleb: A Large-Scale Speaker Identification Dataset. Proceeding of the Annual Conference of the International Speech Communication Association. Stockholm, Sweden; 2017:2610-2620. https://www.semanticscholar.org/paper/VoxCeleb%3A-A-Large-Scale-Speaker-Identification-Nagrani-Chung/8a26431833b0ea8659ef1d24bff3ac9e56dcfcd0.
- View Article
- Google Scholar
6. Chung J, Nagrani A, Zisserman A. Voxceleb2: Deep speaker recognition. Proceedings of the Annual Conference of the International Speech Communication Association, Hyderabad, India. 2018:1086–90.
- View Article
- Google Scholar
7. Bai Z, Zhang X-L. Speaker recognition based on deep learning: An overview. Neural Netw. 2021;140:65–99. pmid:33744714
- View Article
- PubMed/NCBI
- Google Scholar
8. Megha R, Mukul R, Karan A, Elena L, Mary A, Nersisson R. Linguistic based emotion analysis using Softmax over time attention mechanism. PLOS ONE. 2024.
- View Article
- Google Scholar
9. Yu Y, Fan L, Li WJ. Ensemble additive margin softmax for speaker verification. Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Brighton, England. 2019:6046–50. Available from: https://ieeexplore.ieee.org/document/8683649
- View Article
- Google Scholar
10. Zhou D, Wang L, Lee KA. Dynamic margin softmax loss for speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association, Shanghai, China. 2020:3800–4.
- View Article
- Google Scholar
11. Zhong Q, Dai R, Zhang H, Zhu Y, Zhou G. Text-independent speaker recognition based on adaptive course learning loss and deep residual network. EURASIP J Adv Signal Process. 2021;2021(1).
- View Article
- Google Scholar
12. Li N, Tuo D, Su D. Deep discriminative embeddings for duration robust speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association, Hyderabad, India. 2018:2262–6.
- View Article
- Google Scholar
13. Liu Y, He L, Liu J. Large margin softmax loss for speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association, Graz, Austria. 2019:2873–7. Available from: https://www.researchgate.net/publication/335829780_Large_Margin_Softmax_Loss_for_Speaker_Verification
- View Article
- Google Scholar
14. Zhang Y, Yu M, Li N, Yu C, Cui J, Yu D. Seq2Seq Attentional Siamese Neural Networks for Text-dependent Speaker Verification. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, England. 2019:6131–5.
- View Article
- Google Scholar
15. Bhattacharya G, Alam MJ, Gupta V. Deeply fused speaker embeddings for text-independent speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association, Hyderabad, India. 2018:3588–92.
- View Article
- Google Scholar
16. Z C, K K, Hansen JHL. Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Hyderabad, India. 2018;26(9):1633–44.
- View Article
- Google Scholar
17. Bai Z, Zhang X-L, Chen J. Partial AUC Optimization Based Deep Speaker Embeddings with Class-Center Learning for Text-Independent Speaker Verification. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain. 2020:6819–23.
- View Article
- Google Scholar
18. Chen C, Ji C, Li W, Chen D, Wang L, Yang H. Mutual information adaptive estimation for speaker verification. Journal of University of Electronic Science and Technology of China. 2023;52(1):125–31.
- View Article
- Google Scholar
19. Kye S, Jung Y, Lee HB, Sung JH, Hoirin K. Meta-learning for short utterance speaker recognition with imbalance length pairs. Proceedings of the Annual Conference of the International Speech Communication Association, Shanghai, China. 2020:2982–6. Available from: https://www.semanticscholar.org/paper/Meta-Learning-for-Short-Utterance-Speaker-with-Kye-Jung/106d307ac586c499c058d12da3fff5861052fa53
- View Article
- Google Scholar
20. Wen YD, Liu WY, Adrian W, Raj B, Rita S. SphereFace2: Binary classification is all you need for deep face recognition. Proceedings of the International Conference on Learning Representations. Kigali, Rwanda. 2021:1–16. https://www.semanticscholar.org/reader/fe67ba856f8610af3dce291c6bd5f65295caa99b
- View Article
- Google Scholar
21. Huang Z, Wang S, Yu K. Angular softmax for short-duration text-independent speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association, Hyderabad, India 2018:3623–7. Available from: https://ieeexplore.ieee.org/document/8759958
- View Article
- Google Scholar
22. Liu W, Wen Y, Raj B, Rita S, Adrian W. SphereFace revived: unifying hyperspherical face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022;44(1):1–17.
- View Article
- Google Scholar
23. Liu W, Wen Y, Yu Z, Ming L, Bhiksha R, Le S. Sphereface: Deep Hypersphere Embedding for Face Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii,USA. 2017:212–20.
- View Article
- Google Scholar
24. Li L. ReLU-FCM trained by quasi-oppositional bare bone imperialist competition algorithm for predicting employment rate. PLOS ONE. 2022;17(8):e0272624.
- View Article
- Google Scholar
25. He K, Zhang X, Ren S, Jian S. Delving deep into rectifiers: Surpassing human-level performance on Imagenet classification. IEEE International Conference on Computer Vision, Santiago, Chile. 2015:1026–34. Available from: https://ieeexplore.ieee.org/document/7410480
- View Article
- Google Scholar
26. Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust, Speech, Signal Process. 1980;28(4):357–66.
- View Article
- Google Scholar
27. Athulya MS, Sathidevi PS. Speaker Verification from Codec-Distorted Speech Through Combination of Affine Transform and Feature Switching. Circuits Syst Signal Process. 2021;40(12):6016–34.
- View Article
- Google Scholar
28. Ding S, Chen T, Gong X, Zha W, Wang Z. AutoSpeech: Neural architecture search for speaker recognition. Proceedings of the Annual Conference of the International Speech Communication Association, Shanghai,China. 2020:916–20.
- View Article
- Google Scholar
29. Shon S, Tang H, Glass J. Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model. 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece. 2018:1007–13.
- View Article
- Google Scholar
30. Ravanelli M, Bengio Y. Speaker recognition from raw waveform with sincnet. IEEE Spoken Language Technology Workshop (SLT), Athens, Greece. 2018:1021–8.
- View Article
- Google Scholar
31. Liu Y, Wei L-F, Zhang C-F, Zhang T-H, Chen S-L, Yin X-C. Self-supervised contrastive speaker verification with nearest neighbor positive instances. Pattern Recognition Letters. 2023;173:17–22.
- View Article
- Google Scholar
32. Tsai T-H, Chiang M-J. A High-Performance Neural Network SoC for End-to-End Speaker Verification. IEEE Access. 2024;12:165482–96.
- View Article
- Google Scholar
33. Zi Y, Xiong S. Short-Duration Speaker Verification by Joint Filter Superposition-Based Multi-Dimensional Central Difference Feature Extraction and Res2Block-Based Bidirectional Sampling. IEEE Trans Consumer Electron. 2024;70(3):5128–41.
- View Article
- Google Scholar

[ref1] 1. Variani E, Lei X, McDermott E, Ignacio L M; Javier G D. Deep Neural Networks for Small Footprint Text-Dependent Speaker Verification. IEEE International Conference on Acoustics, Speech and Signal Processing, Florence, Italy, 2014: 4052-4056. https://ieeexplore.ieee.org/document/6854363

[ref2] 2. Hinton G, Deng L, Yu D, Dahl G, Mohamed A, Jaitly N, et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process Mag. 2012;29(6):82–97.
View Article
Google Scholar

[3] View Article

[4] Google Scholar

[ref3] 3. Emad SH, Badawi N, Seddeq HS, Adel ZM, Ahmed SO, Atef E. Enhancing speaker identification through reverberation modeling and cancelable techniques using ANNs. PLOS ONE. 2024.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref4] 4. Ruqia B, Zahid M, Asmaa M, Rehan MY, Syed SA. Deep features optimization based on a transfer learning, genetic algorithm, and extreme learning machine for robust content-based image retrieval. PLOS ONE. 2022.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref5] 5. Nagrani A, Chung JS, Zisserman A. VoxCeleb: A Large-Scale Speaker Identification Dataset. Proceeding of the Annual Conference of the International Speech Communication Association. Stockholm, Sweden; 2017:2610-2620. https://www.semanticscholar.org/paper/VoxCeleb%3A-A-Large-Scale-Speaker-Identification-Nagrani-Chung/8a26431833b0ea8659ef1d24bff3ac9e56dcfcd0.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref6] 6. Chung J, Nagrani A, Zisserman A. Voxceleb2: Deep speaker recognition. Proceedings of the Annual Conference of the International Speech Communication Association, Hyderabad, India. 2018:1086–90.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref7] 7. Bai Z, Zhang X-L. Speaker recognition based on deep learning: An overview. Neural Netw. 2021;140:65–99. pmid:33744714
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref8] 8. Megha R, Mukul R, Karan A, Elena L, Mary A, Nersisson R. Linguistic based emotion analysis using Softmax over time attention mechanism. PLOS ONE. 2024.
View Article
Google Scholar

[22] View Article

[23] Google Scholar

[ref9] 9. Yu Y, Fan L, Li WJ. Ensemble additive margin softmax for speaker verification. Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Brighton, England. 2019:6046–50. Available from: https://ieeexplore.ieee.org/document/8683649
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref10] 10. Zhou D, Wang L, Lee KA. Dynamic margin softmax loss for speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association, Shanghai, China. 2020:3800–4.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref11] 11. Zhong Q, Dai R, Zhang H, Zhu Y, Zhou G. Text-independent speaker recognition based on adaptive course learning loss and deep residual network. EURASIP J Adv Signal Process. 2021;2021(1).
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref12] 12. Li N, Tuo D, Su D. Deep discriminative embeddings for duration robust speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association, Hyderabad, India. 2018:2262–6.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref13] 13. Liu Y, He L, Liu J. Large margin softmax loss for speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association, Graz, Austria. 2019:2873–7. Available from: https://www.researchgate.net/publication/335829780_Large_Margin_Softmax_Loss_for_Speaker_Verification
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref14] 14. Zhang Y, Yu M, Li N, Yu C, Cui J, Yu D. Seq2Seq Attentional Siamese Neural Networks for Text-dependent Speaker Verification. ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, England. 2019:6131–5.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref15] 15. Bhattacharya G, Alam MJ, Gupta V. Deeply fused speaker embeddings for text-independent speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association, Hyderabad, India. 2018:3588–92.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref16] 16. Z C, K K, Hansen JHL. Text-independent speaker verification based on triplet convolutional neural network embeddings. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Hyderabad, India. 2018;26(9):1633–44.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref17] 17. Bai Z, Zhang X-L, Chen J. Partial AUC Optimization Based Deep Speaker Embeddings with Class-Center Learning for Text-Independent Speaker Verification. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain. 2020:6819–23.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref18] 18. Chen C, Ji C, Li W, Chen D, Wang L, Yang H. Mutual information adaptive estimation for speaker verification. Journal of University of Electronic Science and Technology of China. 2023;52(1):125–31.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref19] 19. Kye S, Jung Y, Lee HB, Sung JH, Hoirin K. Meta-learning for short utterance speaker recognition with imbalance length pairs. Proceedings of the Annual Conference of the International Speech Communication Association, Shanghai, China. 2020:2982–6. Available from: https://www.semanticscholar.org/paper/Meta-Learning-for-Short-Utterance-Speaker-with-Kye-Jung/106d307ac586c499c058d12da3fff5861052fa53
View Article
Google Scholar

[55] View Article

[56] Google Scholar

[ref20] 20. Wen YD, Liu WY, Adrian W, Raj B, Rita S. SphereFace2: Binary classification is all you need for deep face recognition. Proceedings of the International Conference on Learning Representations. Kigali, Rwanda. 2021:1–16. https://www.semanticscholar.org/reader/fe67ba856f8610af3dce291c6bd5f65295caa99b
View Article
Google Scholar

[58] View Article

[59] Google Scholar

[ref21] 21. Huang Z, Wang S, Yu K. Angular softmax for short-duration text-independent speaker verification. Proceedings of the Annual Conference of the International Speech Communication Association, Hyderabad, India 2018:3623–7. Available from: https://ieeexplore.ieee.org/document/8759958
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref22] 22. Liu W, Wen Y, Raj B, Rita S, Adrian W. SphereFace revived: unifying hyperspherical face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022;44(1):1–17.
View Article
Google Scholar

[64] View Article

[65] Google Scholar

[ref23] 23. Liu W, Wen Y, Yu Z, Ming L, Bhiksha R, Le S. Sphereface: Deep Hypersphere Embedding for Face Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Hawaii,USA. 2017:212–20.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref24] 24. Li L. ReLU-FCM trained by quasi-oppositional bare bone imperialist competition algorithm for predicting employment rate. PLOS ONE. 2022;17(8):e0272624.
View Article
Google Scholar

[70] View Article

[71] Google Scholar

[ref25] 25. He K, Zhang X, Ren S, Jian S. Delving deep into rectifiers: Surpassing human-level performance on Imagenet classification. IEEE International Conference on Computer Vision, Santiago, Chile. 2015:1026–34. Available from: https://ieeexplore.ieee.org/document/7410480
View Article
Google Scholar

[73] View Article

[74] Google Scholar

[ref26] 26. Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust, Speech, Signal Process. 1980;28(4):357–66.
View Article
Google Scholar

[76] View Article

[77] Google Scholar

[ref27] 27. Athulya MS, Sathidevi PS. Speaker Verification from Codec-Distorted Speech Through Combination of Affine Transform and Feature Switching. Circuits Syst Signal Process. 2021;40(12):6016–34.
View Article
Google Scholar

[79] View Article

[80] Google Scholar

[ref28] 28. Ding S, Chen T, Gong X, Zha W, Wang Z. AutoSpeech: Neural architecture search for speaker recognition. Proceedings of the Annual Conference of the International Speech Communication Association, Shanghai,China. 2020:916–20.
View Article
Google Scholar

[82] View Article

[83] Google Scholar

[ref29] 29. Shon S, Tang H, Glass J. Frame-Level Speaker Embeddings for Text-Independent Speaker Recognition and Analysis of End-to-End Model. 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece. 2018:1007–13.
View Article
Google Scholar

[85] View Article

[86] Google Scholar

[ref30] 30. Ravanelli M, Bengio Y. Speaker recognition from raw waveform with sincnet. IEEE Spoken Language Technology Workshop (SLT), Athens, Greece. 2018:1021–8.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref31] 31. Liu Y, Wei L-F, Zhang C-F, Zhang T-H, Chen S-L, Yin X-C. Self-supervised contrastive speaker verification with nearest neighbor positive instances. Pattern Recognition Letters. 2023;173:17–22.
View Article
Google Scholar

[91] View Article

[92] Google Scholar

[ref32] 32. Tsai T-H, Chiang M-J. A High-Performance Neural Network SoC for End-to-End Speaker Verification. IEEE Access. 2024;12:165482–96.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref33] 33. Zi Y, Xiong S. Short-Duration Speaker Verification by Joint Filter Superposition-Based Multi-Dimensional Central Difference Feature Extraction and Res2Block-Based Bidirectional Sampling. IEEE Trans Consumer Electron. 2024;70(3):5128–41.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

Figures

Abstract

Introduction

Objective function based on sample mining

Binary classification objective function

Balancing positive and negative samples problem

Difficulty sample mining problem

Margin adjustment issue

Similarity adjustment

Adaptive objective function with different angular margins

ResNet-P

Experimental analysis

Dataset.

Procedure.

Performance comparison and analysis

Parameter selection and analysis

Convergence comparison and analysis

Conclusion

References