Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Optimized technique for speaker changes detection in multispeaker audio recording using pyknogram and efficient distance metric

Correction

25 Mar 2025: The PLOS One Staff (2025) Correction: Optimized technique for speaker changes detection in multispeaker audio recording using pyknogram and efficient distance metric. PLOS ONE 20(3): e0320922. https://doi.org/10.1371/journal.pone.0320922 View correction

Abstract

Segmentation process is very popular in Speech recognition, word count, speaker indexing and speaker diarization process. This paper describes the speaker segmentation system which detects the speaker change point in an audio recording of multi speakers with the help of feature extraction and proposed distance metric algorithms. In this new approach, pre-processing of audio stream includes noise reduction, speech compression by using discrete wavelet transform (Daubechies wavelet ‘db40’ at level 2) and framing. It is followed by two feature extraction algorithms pyknogram and nonlinear energy operator (NEO). Finally, the extracted features of each frame are used to detect speaker change point which is accomplished by applying dissimilarity measures to find the distance between two frames. To realize it, a sliding window is moved across the whole data stream to find the highest peak which corresponds to the speaker change point. The distance metrics incorporated are standard “Bayesian Information Criteria (BIC)”, “Kullback Leibler Divergence (KLD)”, “T-test” and proposed algorithm to detect the speaker boundaries. At the end, threshold value is applied and their results are evaluated with Recall, Precision and F-measure. Best result of 99.34% is shown by proposed distance metric with pyknogram as compare to BIC, KLD and T-test algorithms.

1. Introduction

Nowadays segmentation process plays a significant role in various areas of speech processing, image processing [1] and multimedia content analysis [2]. It focuses on the partition of input signals into different parts according to their attributes. The preprocessing of audio stream using segmentation separates noise, silence, and speeches and can be applied for audio transcription, word count, and speaker diarization [3,4], speaker recognition, clustering, and indexing [5]. Moreover, it segments the audio stream into silence, speech, speaker, noise, music and other acoustic signals by detecting its boundaries [6]. A review article [7] published in 1998 provides a succinct review of speech research indicating its past, present, and future. To identify features of the speech signal for nonspeech/speech detection having low linguistic information is illustrated in [8]. The content analysis of audio for segmentation and classification, in which stream of an audio is segmented according to audio type or speaker identity is described in [9], A novel technique, DIS_T2_BIC, when no prior knowledge of speakers is assumed for audio speaker segmentation is represented in [10]. A new online method based on Bayesian information criterion (BIC) and the normalized cross-likelihood ratio (NCLR) is illustrated in [11]. The deep learning and audio segmentation research trends are described in [12] via Source-Wise Analysis. The three main categories of audio segmentation are metric-based, model based and hybrid method [13]. Metric based segmentation does not require training data and is used to calculate the distance between two segments containing speech signal. For same speakers the distance value is close to zero and highest value represents different speakers. Also, the point at which the distance value is highest corresponds to speaker change point. BIC [14,15], KLD [8,16] and generalized logliklihood ratio (GLR) [17,18] are the most commonly used distance metric algorithms. Model-based audio segmentation requires training data to train the speaker classes to form a set of models for classification. Combination of both techniques is known as hybrid algorithm in which pre-segmentation uses metric based algorithm followed by model based algorithms and improves the segmentation results [19]. Most of the algorithms performs well for speech segment length greater than 25 milliseconds but degrades its performance for short duration. A comprehensive overview of deep learning-based diarization and its challenges, is illustrated in [20].

1.1 Objectives

To overcome the limitations and problems of speaker segmentation system based on existing distance metrics, this study aims to develop a novel speaker change point detection system by accomplishing the following tasks:

  1. Obtain standard multi-speaker audio recordings for the development and testing of the system;
  2. Develop an algorithm that efficiently separates the speeches, non-speeches and overlapping speeches with music and further enhance the speech specific features;
  3. Propose a new distance metric algorithm based on NCLR that successfully detects the speaker change points;
  4. Evaluating proposed model performance by Recall, Precision and, F_Measure.

1.2 Contribution

The problem arises to the jail authorities when culprit talks to their family members on phone call in coded form of very short duration. The authority couldn’t be able to recognize the speech of third person in their recordings. The ability to estimate the speaker numbers and the speech information present in the multi-speaker audio recording over intervals is valuable in detecting culprit in forensic sciences. The main contribution of this paper is to the development of a novel system based on Pyknogram, discrete energy operator, DWT, and NCLR algorithms. Implementation of pyknogram and discrete energy operator extracts the unique features of speeches of the individual. These features are then used for speaker change point detection by applying proposed distance metric NCLR.

1.3 Structure of the paper

Firstly, this paper reviews the working and limitations of speaker segmentation process and then defines the objectives and contribution of the paper. The second section demonstrates the processing steps of speaker segmentation system that incorporates preprocessing, extracting features, change in speaker point detection, and evaluation criterion. Experimental results and system performance evaluation is discussed in third section. The final section concludes the research work.

2. Proposed speaker segmentation system

The aim of segmenting speaker is to segregates stream of audio into acoustically identical segments using efficient techniques. In this research work without using voice activity detection, the proposed system design is achieved by following basic steps and shown in Fig 1.

  • Preprocessing and framing;
  • Feature extraction using pyknogram;
  • Speaker Change Point (SCP) detection based on proposed distance metrics, NCLR;
  • System performance evaluation and comparison with other distance metrics.

2.1 Preprocessing using DWT and framing

The first step of speaker segmentation system is preprocessing of multi-speaker audio stream. The audio recording carries speeches of various speakers, noises, silent, clapping sound and music. The unwanted signals of silence noise, clapping sound and music must be removed and enhance the strength of speech signal. To do this, an efficient technique known as wavelet transform is implemented. Wavelet is a small wave which starts from zero, oscillates and then dimishes to zero. Wavelet transform is the enhanced form of Fourier transform which represents two parameters of time and Frequency on single graph. Due to the high frequency and time resolution property of DWT, it is applied to investigate a signal simultaneously in time-frequency domain and solves many technical issues in the field of Science, Mathematics, and Engineering. It can be used as signal compression technique, pattern recognition, Image or speech de-noising and scaling of the weak signal [21]. This paper implements DWT (Daubechies wavelet db40 at level 2) to decompose the speech signal into consecutive stages of high and low frequency coefficients [22]. The high frequency coefficients are known as details which carries noisy components of audio recording, and the low frequency coefficients known as approximations carries about 99% of speech information. Its mathematical expression is expressed by Eq (1) [21]. (1) Where, mother wavelet is: (2) Where, m and n are scale and shift parameters respectively. The original audio stream and its corresponding scaled-compressed forms are depicted in Fig 2. It has two subplots; upper waveform shows original speech signal which has 6 X106 Samples and lower subplot is its compressed form and has 15.5 X105 samples known as approximation.

thumbnail
Fig 2. First figure represents the waveform of input signal and second shows its co-processed form in the ratio of 4:1 using DWT.

https://doi.org/10.1371/journal.pone.0314073.g002

As the length of the waveform is too large for further processing, so, it is converted into overlapping frames by applying hamming widow technique. Duration of each frame is 0.03 seconds (i.e., 1323 samples) with frame shift of 0.01 seconds (i.e., 441 samples) shown in upper subplot of Fig 3 [16].

thumbnail
Fig 3. Upper subplot depicts frames of audio-stream and lower subplot represents extracted features of each frame.

https://doi.org/10.1371/journal.pone.0314073.g003

2.2 Feature extraction using pyknogram

Pyknogram is an enhanced form of spectrogram of speech signal. Basically, spectrogram is a three-dimensional graph, illustrating the amplitude or signal energy over time at various frequencies. The vertical axis of the graph corresponds to frequencies, horizontal axis represents time, and the color of the graph illustrates amplitude or loudness of the signal. Dark blue corresponds to low amplitude and brighter colors up through red corresponding to progressively stronger amplitudes. This spectrogram was first enhanced to track the formant frequencies of audio signal and named as pyknogram [23] and is also used to detect the overlapping frames in speech data [24]. The basis of pyknogram is “Nonlinear Energy Operator” is given by the following Eq (3) [25], (3) Where, ‘n’ represents the sample number of digitized speech.

In this research work, pyknogram is used to enhance the formant frequency expressed by Eq (4) and amplitude of the speech signal is reckoned by Eq (5). (4) (5) Where, f is frequency and |a| is amplitude of signal.

The dominant frequency of each frame of duration 0.12 seconds is computed by using Eqs (4) and (5) as follows: (6) Where, a(n) and f(n) in Eq (6) are the amplitude functions and instantaneous frequency. These two are calculated for each sample in the tth frame over the frame length (T samples per frame). Logarithm value of Fw(t), is calculated and expressed in Eq (7). (7) Where, Gw(t,f) is the logarithmic value of weighted average of instantaneous frequency components that enhances the output of pyknogram. The final time-frequency representation of Gw(t,f) is depicted in the second subplot of Fig 3.

2.3 Distance metric for similarity measures

The third step of speaker segmentation system is speaker change point detection which has been accomplished by feature matching measures. As discussed earlier that audio recording was preprocessed and converted to frames of duration 0.12 seconds carrying 1323 samples per frame. Then pyknogram was applied on each frame to extract the features of speeches contained in the frames. To find the boundaries of speaker changes in the Multi-speaker audio recording, the four similarity measures were applied. These are Bayesian Information Criterion [26], Kullback-Leibler Divergence [27], T-test [28] and Proposed distance metric.

2.3.1 Bayesian information criteria.

To facilitate segmentation or speaker change point detection, the most commonly and efficient method applied is BIC or Schwarz Information criterion [13,29]. Its main function is to find the distance between two frames which is represented by ΔBIC and mathematically represented by Eq (8). (8) Where N1, N2 and N are the number of samples in frame1, frame2 and (frame1 + frame2) respectively and λ = 10. Similarly, ∑1, ∑2 and ∑ are the determinants of covariance matrices for frame1, frame2 and (frame1 + frame2) respectively, λ is a penalty weight, and d is a dimension of the feature space.

Since, the Multispeaker audio recordings have multiple frames, so, to find speaker change boundary, a sliding window is moved across all the frames to compute ΔBIC. If ΔBIC is greater than zero for two frames, then it represents similar frames otherwise that particular frame contains speaker change point and it belongs to different speaker. The main disadvantage of BIC is that it does not perform well on short duration data frames.

2.3.2 Kullback-leibler divergence (KLD).

The KLD is a statistical method of computing distance between two populations, segments or frames [13]. In this paper, it has been applied for speaker segmentation to measure similarity between two frames. If N(μ1,∑1) and N(μ2,∑2) are the multivariate Gaussian distribution of two audio frames respectively, then their similarity can be evaluated by KLD distance metric expressed in Eq (9).

(9)

The highest value in a segment shows the dissimilarity and can be considered as speaker change point. This method gives better results for segment length greater than 5 seconds.

2.3.3 T-Test.

Speaker changes point detection using Student T-test is an efficient and extensively used technique for similarity measure in speech processing [17] and object based classification [30]. To check whether two speech frames belong to same speaker or to different speaker, a competent algorithm is required to find the distance between them. In this paper, T-test is applied on the frames which are usually next to each other, to detect the boundaries of the speaker changes. Mathematically, it is represented by Eq (10). (10) Where S1(X) and S2(X) illustrate two frames with m1,σ1, n1, m2, σ2, n2are their respective mean, standard deviation and size. The distance between frames S1 and S2 is computed using Eq (10); a smaller value of Td signifies those two frames belongs to the same speaker and its larger value indicates that the two frames belong to different speaker.

2.3.4 Proposed distance metric algorithm.

In this research work, to measure dissimilarity between two speakers, an efficient distance metric based on Normalized Cross Likelihood Ratio is proposed which was earlier used in speaker diarization system [11,27,31]. If features of two audio segments are A and B of length n and m respectively and size of feature space is p, then ∑A, ∑B and ∑AB represents covariance matrices determinants for the segments A, B and fused AB respectively. The proposed distance metric is defined in Eq (11). (11) Where λ = 10, Q is the equalizing factor, and its value is: (12)

The two audio segments are similar for more positive distance value. Distance value closer to zero shows speaker change points. This algorithm gives better results for the detection of SCP in the speech duration of length 3–5 seconds.

2.4 Performance evaluation

Performance of speaker change point detection can be evaluated by Confusion matrix which is used to reckoned the precision, recall and F-measure [32,33]. For checking the Speaker Change Point (SCP), whether it belongs to the specified frame or not, the evaluation process results in four possible outcomes as shown in Table 1. hit (SCP is present and its predicted value is “Present’’), miss (SCP is present in the frame and the predicted value is “Absent’’), false alarm (SCP is absent in the frame and the predicted value is “Present"), and correct rejection (SCP is absent in the frame and the predicted value is “absent"). Two types of errors were detected in these four outputs: false alarms and missed detections.

  • Missed detection (Error 1): Speaker is not attributed when the SCP exists in the frame.
  • False alarms (Error 2): Speaker is attributed when there is no SCP in the frame.
thumbnail
Table 1. Confusion matrix: Describes the two errors based on ground truth and predicted values of the SCP in the frame.

https://doi.org/10.1371/journal.pone.0314073.t001

Where, TP: True positive, TN: True Negative, FP: False positive, False Negative.

This table is used in computing recall, precision, and F-measure to investigate the performance of the speaker segmentation system.

Precision is the ratio of TP and (TP+FN) as expressed in Eq (13).

(13)

Recall is the ratio of TP points to all the speaker change points in reference file defined in Eq (14).

(14)

The performance evaluation method “F-measure” which is the weighted average of Precision and Recall is reckoned in the following Eq (15).

(15)

Their values vary from 0 to 1. System performance is good for higher value of F-measure.

3. Results

3.1 Database used

The database uses the recordings of utterances of 11 speakers of 15–20 seconds were taken from the Personal Digital Assistant (PDA) speech database in .wav form [34]. These recordings are concatenated in a single recording and used as development data for speaker segmentation system. For the testing of proposed system, two recordings of TV news and TV show of duration 2–3 minutes were used at sampling frequency 44100Hz. TV news carries the recording of Last ten hours of Dr. A P J Abdul Kalam that includes background sad music, silence and speakers voices and the second recording of famous TV show entitled “Dr. Subhash Chandra Show” that carries background music, clapping, silence and speakers voices of short and long durations, act as test data source. Since, the recordings are in MP 3 format, to use it in MATLAB, it is converted into .wav form.

3.2 Experimental results

In this section, based on features extraction and speaker change point detection methods, experiments were performed on two different data sources development database and test database. It elaborates the effect of traditional distance metrics including, BIC, KLD, T-test and proposed algorithm, in detecting speaker boundaries where the speaker change occurs. Initially, development data source was processed and analyzed by speaker segmentation system and then it was tested by test data source.

In section 2, it was already discussed that the features of preprocessed speech signal were extracted by pyknogram, which is the logarithmic value of “weighted average of the instantaneous frequency components” depicted in Eq (7). In order to find the SCP, firstly, BIC was applied on frames for its feature matching. To accomplish this, a sliding window of 30 samples was moved across the whole audio stream to find the distance between two frames. The positive values of BIC or greater than threshold value simply that the frames are belonging to same speaker and vice-versa as shown in Fig 4.

At threshold level less than zero, it detected five change points out of which two were correct. When KLD distance metric was applied at threshold level 4, only nine boundaries were detected, out of which only three were correct. At threshold value 3.8, nine change points were detected which results in increase in false alarm rate and reduces its performance.

Thirdly, T-test distance metric was implemented and their results are graphically represented in Fig 5. It shows that smaller distance value of its output reveals the same speaker. Moreover, at threshold level less than 70 and greater than zero, ten change points were detected. At last, proposed distance metric was applied and detected eight speaker change points corresponding to frame number 30, 860, 890, 910, 940, 1400 and 1630 at threshold value 2.

Finally, their Performance was evaluated by recall, precision, and F_measure by using Eqs (13)–(15). These measures require ground truth of speaker segmentation, so, it was detected manually by using signal Processing tool (SPTOOL) in MATLAB and graphically represented in Fig 6. It contains two overlapping figures: manually detected speaker change points (Blue color) and frames of audio recording.

thumbnail
Fig 6. Manually detected 11 speaker change points and framed signal of development database.

https://doi.org/10.1371/journal.pone.0314073.g006

The frames of manually detected speaker change points were taken as reference points and are used to compare with hypothesized speaker change points for the computation of “Recall”, “Precision” and “F_measure”. The results of three existing distance metrics BIC, KLD, T-test and proposed distance metric are depicted in Table 2.

thumbnail
Table 2. Experimental results of speaker segmentation system for three existing distance metrics and proposed distance metrics applied on development dataset and test datasets.

https://doi.org/10.1371/journal.pone.0314073.t002

Furthermore, performance of speaker segmentation was also tested by test database and evaluated by following the same steps as already discussed. When BIC was used, 10 change points were detected at threshold value 3.8. Results from KLD distance metric reveals less favorable as compared to BIC because it creates more false alarms. T-test and proposed distance metric again shows comparable results at threshold value 70 and 2 respectively.

3.3 Discussion

The performance results of all the algorithms were tabulated in Table 2. The proposed distance metric with pyknogram at threshold value 2, gives improved results of 99.34% 94.12% and 93.90% for development database, Database 1 and Database 2 respectively, when compared to the existing distance metrics BIC, KLD and T-test. Evaluation results shows that proposed method lucidly separates the speakers and scores highest value of F_measure which is very close to T-test distance metric. Also, there is loud clapping sound in the beginning and end of speech which is also clearly detected. When it is compared with the results of manually segmented frames, speech segments and clapping sound of short duration (less than 5 seconds) were not correctly detected. It can be improved by increasing the length of sliding window and frame size.

Fig 7 illustrates the graphical representation of results obtained after analyzing and evaluating the proposed distance metrics and existing distance metrics BIC, KLD and T-test for three databases. It shows that when speech length is less than 5 seconds as in the case of database-1 and database-2, some segments were not detected properly and results lower F-Measure. But in Development database, the speech lengths are greater than 5 seconds and maximum speaker change points were detected which results in higher F-measure. Overall result of the proposed system shows better F_measure than the traditional methods [35,36].

thumbnail
Fig 7. Performance of proposed method for speaker change point detection and traditional distance metrics with various databases.

https://doi.org/10.1371/journal.pone.0314073.g007

3.4 Findings and their significance

The aim of this research is to propose an efficient speaker change point detection model that uses discrete energy operator based pyknogram and proposed distance metric algorithm, NCLR. This method successfully enhances the weak speech signal, suppresses its noise and measures the variability of the spectrum over the time for different speakers. Implementation of the proposed technique has given better results of F-measure in the process of speaker segmentation than the existing techniques of distance metrics, BIC, KLD and T-test as shown in Table 2. The scope of this proposed method is in Forensic speaker recognition applications where the speaker’s voice is first detected in a multispeaker audio recording and then segmented to extract the information carried in it. This method works well for speeches of duration greater than 5 seconds but to extract the speech information from speeches of duration less than 3 seconds is very challenging in an application of speaker diarization.

4. Conclusions

In speech processing, segmentation and clustering algorithms plays a vital role in speaker recognition, speaker diarization, word count, and audio transcription. In this research paper, a novel distance metric algorithm has been proposed to find the boundaries at which speaker changes in the recording of audio conferences. Basically, in this research work, two databases: development data and test data were used. Initially, the audio recording of development data was compressed and denoised, at threshold value 0.03, using discrete wavelet transform and then it was partitioned into frames. The features of each frame were extracted by pyknogram in which logarithmic value of weighted average of instantaneous frequency was calculated. Then, distances between frames were obtained by applying distance metrics: BIC, KLD, T-test and proposed algorithm, with the help of sliding window, to detect the boundaries of speakers. Furthermore, by following the same processing steps of segmentation, test datasets were applied and their results were evaluated by using F-measure. It is concluded that the proposed distance metric with pyknogram at threshold value 2, gives improved results of 99.34% 94.12% and 93.90% for development dataset and two test datasets respectively when compare to other distance metrics. In future, the proposed technique could be enhanced to handle the speeches of short duration (< 3 seconds) in speech processing.

References

  1. 1. Ramola A., Shakya A. K., and Van Pham D., “Study of statistical methods for texture analysis and their modern evolutions,” Eng. Reports, vol. 2, no. 4, pp. 1–24, 2020.
  2. 2. Shakya A. K., Ramola A., and Vidyarthi A., “Conversion of Landsat 8 multispectral data through modified private content‐based image retrieval technique for secure transmission and privacy,” Eng. Reports, vol. 2, no. 12, pp. 1–31, 2020.
  3. 3. Miro X. A., Bozonnet S., Evans N., Fredouille C., Friedland G., and Vinyals O., “Speaker Diarization: A Review of Recent Research,” IEEE Trans. Audio, Speech Lang. Process., vol. 20, no. 2, pp. 356–370, 2012.
  4. 4. Moattar M. H. and Homayounpour M. M., “A review on speaker diarization systems and approaches,” Speech Commun., vol. 54, no. 10, pp. 1065–1103, 2012.
  5. 5. Anusuya M. and Katti S., “Speech recognition by machine: A review,” Int. J. Comput. Sci. Inf. Secur., vol. 6, no. 3, pp. 181–205, 2009.
  6. 6. Ajmera J., McCowan I., and Bourlard H., “Robust speaker change detection,” Signal Process. Lett. IEEE, vol. 11, no. 8, pp. 649–651, 2004.
  7. 7. Juang B. H. (ed), Juang B. H., and Tsuhan Chen, “The past, present and future of speech processing,” IEEE Signal Process. Mag., vol. 15, no. May, pp. 24–48, 1998.
  8. 8. Hari S., Parthasarathi K., Member S., Gatica-perez D., Bourlard H., and Magimai M., “Privacy-Sensitive Audio Features for Speech / Nonspeech Detection,” vol. 19, no. 8, pp. 2538–2551, 2011.
  9. 9. Lu L., Zhang H. J., and Jiang H., “Content analysis for audio classification and segmentation,” IEEE Trans. Speech Audio Process., vol. 10, no. 7, pp. 504–516, 2002.
  10. 10. C. Universitaire and C. U. Cedex, “HYBRID APPROACH FOR UNSUPERVISED AUDIO SPEAKER SEGMENTATION,” no. Eusipco, 2006.
  11. 11. Grašič M., Kos M., and Kačič Z., “Online speaker segmentation and clustering using cross-likelihood ratio calculation with reference criterion selection,” IET signal Process., vol. 4, no. 6, p. 673, 2010.
  12. 12. Aggarwal S. et al., “Audio Segmentation Techniques and Applications Based on Deep Learning,” Sci. Program., vol. 2022, pp. 1–9, 2022.
  13. 13. Kotti M., Moschou V., and Kotropoulos C., “Speaker segmentation and clustering,” Signal Processing, vol. 88, no. 5, pp. 1091–1124, 2008.
  14. 14. Wang D., Vogt R., and Sridharan S., “Eigenvoice modelling for cross likelihood ratio-based speaker clustering: A Bayesian approach,” Comput. Speech Lang., vol. 27, no. 4, pp. 1011–1027, 2013.
  15. 15. Kotti M., Benetos E., and Kotropoulos C., “Computationally efficient and robust BIC-based speaker segmentation,” IEEE Trans. Audio, Speech Lang. Process., vol. 16, no. 5, pp. 920–933, 2008.
  16. 16. Yella S.H. et al., “Overlapping Speech Detection Using Long-Term Conversational Features for Speaker Diarization in Meeting Room Conversations”, IEEE/ACM Transactions On Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1688–1700, December 2014.
  17. 17. Nguyen T. H., Chng E. S., and Li H., “T-Test Distance and Clustering Criterion for Speaker Diarization,” no. 4, pp. 2–5.
  18. 18. R. Gangadharaiah, B. Narayanaswamy, and N. Balakrishnan, “A novel method for two-speaker segmentation,” 8th Int. Conf. Spok. Lang. Process., pp. 2337–2340, 2004.
  19. 19. Reynolds D., Kenny P., and Castaldo F., “A study of new approaches to speaker diarization.,” Interspeech, pp. 1047–1050, 2009.
  20. 20. Park T. J., Kanda N., Dimitriadis D., Han K. J., Watanabe S., and Narayanan S., “A review of speaker diarization: Recent advances with deep learning,” Comput. Speech Lang., vol. 72, 2022.
  21. 21. Benchikh S. et al., “Efficiency evaluation of different wavelets for image compression’, The 11th International Conference on Information Sciences, Signal Processing and their Applications: Special Sessions, pp. 1420–21, 2012.
  22. 22. S Kaur, J S Sohal, “Speaker change detection using teaser Kaiser energy operator and wavelet transform”, SSRG International Journal of Computer Science and Engineering (SSRG-IJCSE)–vol. 4 Issue7–July2017
  23. 23. Kaur S., Sohal J. S., Optimized Speaker Diarization System using Discrete Wavelet Transform and Pyknogram. International Journal on Future Revolution in Computer Science & Communication Engineering, vol.3, no. 9, pp.52–58, 2017.
  24. 24. Maragos P., Member S., Kaiser J. F., Quatieri T. F., and Member S., “Application to Speech Analysis S:, s:,” vol. 41, no. 10, pp. 3024–3051, 1993.
  25. 25. N. Shokouhi, A. Ziaei, A. Sangwan, and J. H. L. Hansen, 2015, “Robust Overlapped Speech Detection And Its Application In Word-Count Estimation For Prof-Life-Log Data, IEEE International Conference on acoustics, Speech and Signal processing 978, 4724–4728.
  26. 26. A. Sholokhov, T. Pekhovsky, O. Kudashev, A. Shulipa, and T. Kinnunen, “Bayesian Analysis of Similarity Matrices For Speaker Diarization Speech Technology Center Ltd., St. Petersburg, Russia School of Computing, University of Eastern Finland,” pp. 106–110, 2014.
  27. 27. V. Le, O. Mella, D. Fohr, S. Group, and C. Scientifique, “Speaker Diarization using Normalized Cross Likelihood Ratio,” pp. 1869–1872, 2007.
  28. 28. Stetten G., Horvath S., Galeotti J., Shukla G., Wang B., and Chapman B., “Image segmentation using the student’s t-test and the divergence of direction on spherical regions,” Med. Imaging 2010 Image Process., vol. 7623, p. 76233I, 2010.
  29. 29. Gopalakrishnan P.S.,1998, “Clustering Via The Bayesian Information Criterion With Applications in Speech Recqgnition,” in proc. IEEE, 645–648.
  30. 30. Gibson E., Hu Y., Huisman H. J., and Barratt D. C., “Designing image segmentation studies: Statistical power, sample size and reference standard quality,” Med. Image Anal., vol. 42, pp. 44–59, 2017. pmid:28772163
  31. 31. Le V. B., Mella O., and Fohr D., “Speaker diarization using normalized cross likelihood ratio,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2, pp. 873–876, 2007.
  32. 32. Trec, “Tutorial: Common Evaluation Measures,” Nist, 2006.
  33. 33. Alnuaim A. A. et al., “Speaker Gender Recognition Based on Deep Neural Networks and ResNet50,” Wirel. Commun. Mob. Comput., vol. 2022, 2022.
  34. 34. Yasunari Obuchi, “The CMU Audio Databases”, March 2003, http://www.speech.cs.cmu.edu/databases/pda
  35. 35. K. Verma et al., Latest tools for data mining and machine learning(Article)(Open Access).
  36. 36. Singh G., Mantri A., Sharma O., and Kaur R., “Virtual reality learning environment for enhancing electronics engineering laboratory experience,” Comput. Appl. Eng. Educ., vol. 29, no. 1, pp. 229–243, 2021.