Fig 1.
Block diagram of the proposed speaker identification system.
Fig 2.
Time-frequency representation of the speech signal.
(A) a typical speech waveform taken from the YOHO database (to produce spectrogram and neurogram of that signal), (B) the corresponding spectrogram responses, and (C) the respective neurogram responses.
Fig 3.
Illustration of the effects of noise on the neural responses.
Neurogram responses are shown for a typical speech signal taken from the YOHO dataset. The neurogram to the clean speech signal is shown in the panel A, and the two neurograms in response to speech signal distorted by two levels of white Gaussian noise are shown in panels B (10 dB SNR) and C (0 dB SNR).
Fig 4.
Illustration of the effects of noise on the neural responses.
Neural responses were simulated to five clean speech signals taken from the YOHO database and to the corresponding noisy signals for SNRs of 0 and 15 dB with three types of noise: white Gaussian noise, pink and street noise. The correlation coefficients were calculated between the clean and the corresponding noisy signals for each CF, and the results are shown for 25 CFs (up to a phase-locking range of ~4 kHz). Panel A shows the mean and standard deviation of the correlation coefficients calculated for white Gaussian noise, and the corresponding results for pink and street noise are shown in panel B and C, respectively.
Fig 5.
Speaker identification performance of the proposed and existing methods using YOHO database.
Two ranges of frequency bands are considered. Left panels: narrowband in which features corresponding to ~<1 kHz are used for SI evaluation; right panels: wideband in which features corresponding to the full range of frequencies (up to ~4 kHz) are considered. Results are shown as a function of SNR with three different types of noise (A: white Gaussian noise, B: pink noise, and C: street noise). Speech samples from 137 speakers were used for evaluation and comparison of the performance of different methods.
Fig 6.
Speaker identification performance of the proposed and existing methods using the TIMIT database.
Left panels: Performance is shown for features corresponding to frequencies ~<1 kHz; right panels: features corresponding to the full range of frequencies (up to ~4 kHz) are considered for SI evaluation. Results are shown as a function of SNR with three different types of noise (A: white Gaussian noise, B: pink noise, and C: street noise). Speech samples from 100 speakers were used for evaluation and comparison of the performance of different methods.
Fig 7.
Speaker identification performance of the proposed and existing methods using the TIDIGIT database.
Left panels show the SI performances using the features extracted from the lower frequencies (narrowband: <1 kHz), and right panels represent the performances using features from the wideband frequencies. Results are shown as a function of SNR with three different types of noise (A: white Gaussian noise, B: pink noise, and C: street noise). Speech samples from 40 speakers were used for evaluation and comparison of the performance of different methods.
Fig 8.
Text-dependent speaker identification performance of the proposed and existing methods using the UM database.
Two cases of frequency bands were considered: left panels show the performance of the SI systems using features from the narrowband frequencies (<1 kHz), and the right panels represent performances for the wideband frequencies. Results are shown as a function of SNR with three different types of noise (A: white Gaussian noise, B: pink noise, and C: street noise). Speech samples from 39 speakers were used for evaluation and comparison of the performance of different methods.
Table 1.
The effect of window resolution on the performance of the proposed SI system in quiet and under noisy conditions.
The experiment was done with YOHO speech materials taken from the first 32 speakers.