Figures
Abstract
Purpose
Talker identification is a crucial auditory skill that underpins human social communication and forensic applications. However, real-world conditions pose several challenges-such as environmental noise, channel variability, language familiarity, and talker familiarity-that can undermine the accuracy of auditory identification. In light of the limitations and insights from previous studies, the present study employed auditory experiments to systematically examine the impact of these four adverse factors on talker identification.
Methods
The study aimed to address two questions: (1) whether the independent and interactive effects among these factors are significant, and (2) whether lab-training can enhance talker identification accuracy. Using a voice line-up paradigm, this study conducted a perception experiment where speech stimuli were presented under four primary conditions: noise (No Noise vs. Noise), channel (High-quality vs. High-quality; Landline vs. Landline, High-quality vs. Landline), language (Mandarin, Reversed Mandarin, English, Reversed English), and speaker familiarity (assessed through listening tests and lab training). Auditory responses to the stimuli under these adverse conditions were collected from 53 listeners.
Results
The findings indicate that environmental noise and channel variability have significantly negative effects on talker identification, while intelligible speech yields superior performance under adverse conditions compared to unintelligible reversed speech. Furthermore, the study found that lab-training (i.e., increasing talker familiarity) could enhance talker identification accuracy under adverse conditions, although it does not improve accuracy under no noise and high-quality channel conditions.
Conclusion
This paper systematically examines the interactive effects of multiple adverse factors on talker identification, thereby advancing our understanding of the auditory mechanisms underlying human social speech communication and providing important theoretical support for auditory examination techniques in forensic speaker identification.
Citation: Fan N, Geng P, Li Z, Guo H (2026) Talker identification under adverse auditory conditions-The impacts of noise, channel, language, and familiarity. PLoS One 21(2): e0339396. https://doi.org/10.1371/journal.pone.0339396
Editor: Gauri Mankekar, LSU Health Shreveport, UNITED STATES OF AMERICA
Received: August 6, 2025; Accepted: December 7, 2025; Published: February 23, 2026
Copyright: © 2026 Fan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The data analyzed during the current study has been uploaded as supplemental material.
Funding: This research was supported by the grants from Ministry of Finance of the People’s Republic of China (GY2024G-5 to P.G.) and Shanghai Education Science Research Project “Shanghai Universities Philosophy and Social Sciences Research Special Project” (2025ZSS007 to N. F.). There was no additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Talker identification is a crucial auditory skill underpinning human social communication, from the early recognition of caregivers in infancy to the complex interactions of adulthood [1,2], and extending to its significant role in forensic applications [3]. In forensic contexts, reliable voice identification is essential for judicial proceedings, drawing on both the naturalistic judgments of ear witnesses and the systematic evaluations conducted by forensic experts [3,4]. However, real-world conditions introduce several challenges-such as environmental noise, channel variability, language familiarity, and talker familiarity-that can undermine the accuracy of auditory identification. Consequently, examining the impact of these adverse auditory conditions is critical not only for refining cognitive models of speech processing but also for advancing effective and robust applications in both communicative and forensic settings.
1.1 Effect of noise on talker identification
Research has confirmed that environmental noise significantly disrupts critical acoustic cues [5,6], thereby potentially impairing talker identification accuracy. However, although the number of studies is limited, existing research yields inconsistent results regarding the impact of noise on talker identification. It has been reported that, in three distinct noise environments (i.e., speech-shaped noise, multi-talker babble, and a single, unfamiliar competing talker), the identification accuracy declined as the signal-to-noise ratio (SNR) decreased across all noise conditions, with the most pronounced reduction occurring under multi-talker babble conditions [7]. Similarly, Mamun et al. [8] found that both cochlear implant users and healthy control groups experienced significant declines in talker identification accuracy when exposed to speech-shaped noise.
Other studies highlighted a more complex influence of noise, while aged and hearing-impaired female listeners did not show significant changes under noise or competing talker conditions, hearing-impaired male listeners were significantly affected [9,10]. Furthermore, Kanber et al. [11] found that in a four-talker babble environment, there was no significant difference in identification accuracy between familiar and unfamiliar talkers, with both conditions averaging around 80% accuracy. A review of previous studies also indicates that, regardless of whether the listener is normally hearing or hearing-impaired, and regardless of the familiarity of the talker, identification accuracy rarely exceeds 90% under various noisy conditions [8–11].
1.2 Effect of channel variability on talker identification
Channel variability is another common factor influencing daily speech communication and forensic talker identification (e.g., recordings from landline phones and high-definition mobile phones). One fundamental impact of landline phone use, for instance, is its limited frequency range of 400–3400 Hz [12]; this restriction can affect the transmission of crucial acoustic cues, such as F0 and formant frequencies below 400 Hz and above 3400 Hz [3,13], thereby potentially compromising accurate talker identification. However, only a few studies have examined this factor to date. It is found that the channel (i.e., landline vs. mobile phones) significantly affected talker identification accuracy (i.e., approximately 74%) with its negative impact surpassing that of language and dialect (i.e., 81%−86%) [3].
Moreover, the authors pointed out that research on multi-factor interactions in talker identification (e.g., various languages mixed with channel variability) remains extremely scarce. One of the latest studies further revealed that consonant-based talker identification is not affected by channel variability (i.e., full-band, telephone-band, and non-telephone-band recordings), whereas vowel-based talker identification is significantly influenced by the channel [14].
1.3 Effect of language familiarity on talker identification
The language familiarity effect is one of the most popular and controversial topics in talker identification research, and it is a key focus of the current study. The central debate in the extant literature concerns whether language intelligibility exerts an influence on talker identification. Specifically, researchers have questioned whether talker identification necessitates language comprehension [15] or whether it can be accomplished without an understanding of the language [16].
The argument in favor of language-independent talker identification originally emerged from early neuropathological research. For instance, patients with receptive aphasia-characterized by impaired language comprehension-can still recognize speakers, whereas patients with phonagnosia lose the ability to identify talkers despite intact ability of language comprehension [17]. Fleming et al. [16] further substantiated this perspective through a perceptual experiment employing backward Chinese and English sentences. Their findings indicated that, although the reversed sentences were largely unintelligible, native English speakers did not exhibit significant cross-language differences in talker identification accuracy; in other words, enhanced familiarity with English phonology did not translate into improved identification performance. Other study has reported similar finding that no significant difference was observed in a talker similarity rating task based on forward and backward speech [18].
Conversely, using a paradigm similar to that of Fleming et al. [16], Perrachione et al. [19] reported results that strongly suggest talker identification is contingent upon language comprehension. In support of this view, several studies involving infants, individuals with dyslexia, and second-language learners have demonstrated that auditory talker identification is facilitated by language comprehension; that is, listeners are generally more adept at discriminating between speakers when the linguistic context is familiar [15,20–22]. Mary Zarate et al. [23] extended this line of inquiry by examining talker identification among native English speakers using a range of stimuli, including non-linguistic sounds, Chinese, German, pseudo-English, and English. Their results revealed a progressive improvement in identification accuracy correlating with increased language familiarity (non-linguistic < Chinese < German < pseudo-English < English). Similarly, other studies demonstrated that talker identification accuracy was significantly higher for rhyming word pairs (e.g., “day-bay”) compared to unrelated word combinations (e.g., “day-bee”), thereby underscoring the role of phonological familiarity [24,25].
1.4 Effect of speaker-familiarity/training on talker identification
Another factor influencing talker identification is speaker familiarity. In recent years, researchers have examined the impact of familiarity by comparing the performance of listeners with familiar versus unfamiliar speakers and by employing lab-training paradigms. The majority of studies report that listeners demonstrate significantly higher accuracy when identifying familiar voices compared to unfamiliar ones [26–30]. Nevertheless, even though familiar talkers are identified more accurately, listener performance is not invariably flawless [27,28]. One plausible explanation for the speaker familiarity effect is that listeners are able to extract distinctive acoustic features or leverage prior knowledge associated with familiar speakers [31].
Furthermore, the potential of lab-training to enhance talker identification accuracy has only recently attracted attention over the past two decades. Several investigations have demonstrated that perceptual training can lead to improvements in talker identification accuracy [32–34]. Kanber et al. [11] compared the recognition accuracy among personally familiar voices, lab-trained voices, and unfamiliar voices, and found that brief training (i.e., 5–10 minutes) was sufficient to enhance identification performance. In contrast, other study reported that training does not consistently yield improvements in talker identification; specifically, training benefits observed with foreign-language talkers were restricted to the trained speaker set and did not generalize to novel foreign-language voices [35]. Similarly, McLaughlin et al. [36] found no significant enhancement in talker identification accuracy following training in conditions involving an unfamiliar language.
1.5 The present study
In summary, existing research on talker identification under adverse conditions (i.e., environmental noise, channel variability, language familiarity, and talker familiarity) remains limited in quantity, and the findings continue to be contentious. While the effects of these four adverse factors have been investigated individually, to the best of our knowledge based on the current literature, their combined influence on talker identification has yet to be examined. Consequently, in light of the insights and gaps in the current literature, the present study aims to address two primary questions:
- (1). What are the individual and interactive effects of noise, channel variability, and language familiarity on talker identification?
- (2). Can lab-training designed to enhance talker familiarity improve talker identification accuracy under adverse conditions?
2. Method
The research was approved by the Committee for the Protection of Human Subjects (CPHS) at the Academy of Forensic Science (Shanghai, China) [No. 2023−15]. All participants were informed about the study’s purpose, provided written consent form, and received financial compensation after completion of the experiment. Participants were informed that they could withdraw from the experiment at any time if they chose to discontinue their participation. All participants involved in the current study were recruited to participate in this experiment between March and April 2025.
2.1 Participant
A preliminary power analysis was conducted via the pwr package in R [37,38]. It indicated a sample size > 21.10 was needed to detect a large effect size (Cohen’s f = 0.4; [39]), with a significance level of 0.05 and statistical power of 0.80. Consequently, a total of 53 native Mandarin speakers (33 females, 20 males) participated in this study. All participants were undergraduate or graduate students recruited from some universities in China. All participants used English as their second language and had passed the CET-4 (College English Test), indicating an intermediate level of English proficiency. Additionally, none of the participants had received professional auditory training (e.g., musical training) that might bias their auditory perception. The female participants had a mean age of 26.21 years (SD = 3.36), and the male participants had a mean age of 27.36 years (SD = 6.74). None of the participants reported a history of speech or hearing impairments. Upon completion of the study, participants were provided with appropriate financial compensation.
2.2 Stimuli
Four female native speakers aged from 31 years to 38 years (SD = 3.16) were recruited to record the speech stimuli for this study. All speakers are fluent in standard Mandarin. They use English as their second language, and each has passed the CET-4, indicating intermediate English proficiency. Additionally, none of the speakers have a history of speech or hearing impairments.
As shown in Table 1, eight target sentences were constructed in both Chinese and English versions, with each sentence comprising 4–11 words. All speech stimuli were recorded in a sound-attenuated room using a high-quality digital recorder (i.e., SONY PCM-D100). Additionally, during a telephone call initiated from an iPhone 14 Pro Max, simultaneous recordings were acquired using a landline telephone (i.e., Motorola C7501RC). The digital recorder and the iPhone 14 Pro Max were positioned 30 cm from the speakers’ mouths. Prior to recording, the speakers were given ample opportunity to familiarize themselves with the materials and practice as needed. They were instructed to articulate each target sentence in their habitual neutral voice twice. Considering the natural variability in a speaker’s acoustic features even when uttering identical content [40], and to maintain ecological validity with daily communication and forensic contexts, different rounds of utterances were used if two sequentially presented speech stimuli originated from the same speaker. All recordings were saved in WAV format at a 44.1 kHz sampling rate and 16-bit resolution. In total, 4 (speakers) * 8 (target sentences) * 2 (Chinese and English) * 2 (times) * 2 (digital recorder and landline phone) = 256 recordings were collected.
The speech stimuli were firstly normalized to 70 dB and subsequently reversed using Praat software [41]. Consequently, four categories of speech stimuli (i.e., Mandarin, Mandarin-reverse, English, English-reverse) were created to examine the influence of language familiarity on talker identification. These stimuli were then divided into two groups to assess the impact of channel variability. Specifically, those recorded using the digital recorder (including both forward and reversed versions) were designated as High-quality (H), while the recordings obtained via the landline telephone were labeled as Landline (L).
To further investigate the effects of noise on talker identification, high-quality speech stimuli across all four categories were synthesized with a mixed noise component, following a methodology analogous to that employed in previous speech-in-noise perception tasks (e.g., [42,43]). Previous studies have frequently employed sine waves and broadband noise (e.g., white noise) in the investigation of speech-in-noise perception, revealing that both exert a masking effect on the transmission of speech information [44–48]. To emulate as closely as possible the impact of noise on speech perception in realistic interference scenarios, the present study generated a composite noise signal by combining sine waves and white noise, employing the default formula integrated within Praat (i.e., 1/2 * sin (2π × 377 × x) + randomGauss (0, 0.1)) at a sampling rate of 44.1 kHz. For all speech materials under noise conditions, the signal-to-noise ratio (SNR) was maintained at 0 dB.
To examine the influence of speaker familiarity on talker identification, a lab-training paradigm was employed in the auditory perceptual experiment. All speech stimuli from the speakers were divided into two sessions (1–4 target sentences for the listening test; 5–8 target sentences for the lab-training test). For the listening test, stimuli representing the four categories (i.e., Mandarin, Mandarin-reverse, English, English-reverse) under adverse noise (i.e., No Noise vs. Noise) and channel conditions (i.e., High-quality vs. High-quality; Landline vs. Landline; High-quality vs. Landline) were utilized. In the lab-training test, listeners were firstly exposed to the four categories of speech stimuli in adverse noise and channel conditions from a single speaker twice, after which they completed a perceptual talker identification task for that speaker. This procedure was conducted sequentially for all four speakers. The speech stimuli of the two sessions (i.e., listening test and lab-training test) for talker identification experiment were shown in Table 2.
It is important to note that, to limit experimental sessions to approximately 40 minutes and maintain participant engagement and attention, this stimulus set of the current study has several limitations. For instance, it only included four female talkers, used a relatively high signal-to-noise ratio (SNR = 0), and featured just two training rounds. These limitations necessitate caution when interpreting the study’s results, as they may constrain the generalizability of the findings. Nevertheless, the study systematically examines interactions among multiple adverse factors in talker identification, offers key insights into auditory talker identification patterns under complex adverse conditions, and lays groundwork for understanding how listeners process talker information amid combined auditory challenges. Future research can build on these findings by conducting more targeted, comprehensive investigations to address these constraints.
2.3 Procedure
The perceptual experiment was conducted in a sound-attenuated room. Each participant was instructed to sit in front of a laptop monitor and adjust the screen to a position that allowed clear visibility. Professional high-quality headphones (Sennheiser HD650 and Audio-Technica ATH-M70x) were used in the perceptual experiment. The experiment was conducted with PsychoPy software [49].
The procedure for the perceptual talker identification experiment was illustrated in Fig 1. For listening test session, each trial began with a 500-millisecond red fixation cross. Subsequently, two stimuli (i.e., either from the same speaker or from different speakers) were presented in a voice line-up paradigm, separated by a 400-millisecond silent interval. Participants were then required to select one of three response options (i.e., Same, Different, or Unclear) based on the stimuli they heard. For the lab-training test session, the procedure commenced with two rounds of auditory training using the speech stimuli from a single speaker (as shown in Table 2). Following the training phase, participants engaged in a talker identification task that followed the same procedure as the listening test session. The entire experiment lasted approximately 30–45 minutes. To mitigate auditory fatigue, participants were permitted to take breaks at any time during the session. Before the formal data collection, participants were provided with instructions for the perceptual experiment. They also completed two practice trials to familiarize themselves with the experimental procedure. Subsequently, perceptual data for each stimulus were collected from all 53 participants.
2.4 Data analysis
Two generalized logistic regression analyses were conducted for the listening test session using the afex package [50] in R software [38] to investigate the impact of adverse auditory conditions on talker identification accuracy. In these models, each stimulus’s perceptual judgment was re-coded as a binary outcome (0 for an incorrect response, 1 for a correct response; “unclear” responses were excluded from the analysis) and served as the dependent variable. For one model, the independent variables were noise (No Noise vs. Noise) and language (Mandarin, Mandarin-reverse, English, English-reverse); for the other, they were channel (High-quality vs. High-quality; Landline vs. Landline; High-quality vs. Landline) and language. The models were constructed using the following formulas: Answer ~ Noise * Language + (1 | Speaker) + (1 | Listener); Answer ~ Channel * Language + (1 | Speaker) + (1 | Listener).
Furthermore, two additional generalized logistic regression analyses were performed to assess the effect of lab-training on talker identification under adverse auditory conditions. In these analyses, perceptual accuracy, coded as 0 or 1, was the dependent variable. For one model, the independent variables were familiarity (Listening test vs. Lab-training test), noise, and language; for the other, they were familiarity, channel, and language. These models were specified as follows: Answer ~ Train * Noise * Language + (1 | Speaker) + (1 | Listener); Answer ~ Train * Channel * Language + (1 | Speaker) + (1 | Listener).
In all models, the random intercepts for speakers and listener as well as the random slope for noise, channel, and language by the listener were included in all models to support the maximal random effect structure design [51]. The likelihood ratio test was used to assess the importance of the random slope, which indicated that the slope was not significant in any of the model fittings. Consequently, to maintain model simplicity, the random slope was removed from all models. Tukey’s HSD post hoc tests were subsequently performed for pairwise comparisons [52], and odds ratios were reported as the measure of effect size.
3. Results
The average accuracies for the talker identification task under noise (No Noise vs. Noise), channel (High-quality vs. High-quality [HH]; Landline vs. Landline [LL]; High-quality vs. Landline [HL]), and language conditions (Mandarin [M], Mandarin-reverse [M-reverse], English [E], English-reverse [E-reverse]) across the two sessions (Listening Test vs. Lab-training) are presented in Figs 2 and 3. As shown in Table 3, a raw comparison of the statistical results indicated that noise, poorer signal transmission (i.e., Landline vs. Landline), and channel discrepancies (High-quality vs. Landline) all resulted in reduced talker identification accuracy. Although forward speech yielded significantly superior talker identification performance compared to backward speech, no language familiarity effect was observed (i.e., talker identification accuracies were comparable for Mandarin and English). Additionally, lab-training (i.e., higher speaker familiarity) moderately improved talker identification accuracy. In terms of response times, longer identification times were observed under conditions of poorer signal transmission and channel difference conditions. Following lab-training, response times decreased across all conditions.
To further illustrate the impact of these adverse auditory conditions on talker identification, two generalized logistic regression models were conducted. These results of the models (as shown in S1 Appendix) revealed significant main effects of “Noise”, “Channel” and “Language”, as well as significant two-way interaction effects of “Noise × Language” and “Channel × Language” (p < 0.05) on talker identification accuracy. In instances where a higher-order interaction effect was significant, the corresponding main effects and lower-order interaction effects were not interpreted.
As shown in S1 Appendix, the results of Tukey-HSD post hoc analysis for the two-way interaction effect of “Noise × Language” demonstrated that, (1) for reversed speech, talker identification accuracies were significantly higher in the no noise condition than in the noise condition, while for forward speech, there was no significant difference in talker identification accuracies between the no noise and noise conditions; (2) forward speech yielded higher identification accuracies than reversed speech under both the no noise and noise conditions. Additionally, significantly lower talker identification accuracy was observed for reversed Mandarin speech under the noise condition compared to reversed English.
The results of the Tukey HSD post hoc analysis examining the two-way interaction effect of “Channel × Language” are presented in S1 Appendix. Overall, talker identification accuracy was highest for the High-quality vs. High-quality condition. In addition, identical channel conditions (i.e., High-quality vs. High-quality and Landline vs. Landline) generally yielded higher accuracy than mismatched channel conditions (i.e., High-quality vs. Landline) across all language conditions, with two exceptions: no significant difference was observed for reversed English between the High-quality vs. High-quality and Landline vs. Landline conditions, and for reversed Mandarin between the High-quality vs. Landline and Landline vs. Landline conditions. Furthermore, the post hoc analyses revealed that, (1) for High-quality vs. High-quality condition, forward speech exhibited significantly higher identification accuracy than reversed speech; (2) for Landline vs. Landline condition, reversed Mandarin speech showed lower accuracy than the other three language conditions; and (3) for High-quality vs. Landline condition, English speech showed higher accuracy than the other three language conditions.
Considering the impact of speaker familiarity on talker identification, the results from two generalized logistic regression models are shown in S1 Appendix. These models identified significant main effects of “Familiarity”, “Noise”, “Channel” and “Language”, significant two-way interaction effects of “Familiarity × Language” and “Channel × Language”, and significant three-way interaction effects of “Familiarity × Noise × Language” and “Familiarity × Channel × Language” (p < 0.05) on talker identification accuracy.
The Tukey HSD post hoc analysis for the three-way interaction of “Familiarity × Noise × Language” revealed that talker identification accuracy was significantly higher in the Lab-training session compared to the Listening test session only for reversed English in the no noise condition {β = 0.40, SE = 0.18, t = 2.22, p = 0.03, OR = 1.45 (95% CI: [1.03, 2.03])} and reversed Mandarin in the noise condition {β = 0.90, SE = 0.16, t = 5.60, p < 0.001, OR = 2.30 (95% CI: [1.70, 3.12])}.
Furthermore, as shown in S1 Appendix, the Tukey HSD analysis for the three-way interaction of “Familiarity × Channel × Language” indicated that the improvement in identification accuracy following lab-training (i.e., higher speaker familiarity) was observed exclusively in the poor signal transmission condition (i.e., Landline vs. Landline), with the exception of English speech. Additionally, a positive impact of speaker familiarity was found for Mandarin speech in the High-quality vs. Landline condition and for reversed English speech in the High-quality vs. High-quality condition.
4. Discussion
The current study aims to investigate talker identification under adverse auditory conditions and whether lab-training can enhance listeners’ performance. Through perceptual experiments of talker identification conducted in two sessions (i.e., Listening Test and Lab-training Test), the study found that adverse auditory conditions, specifically environmental noise, channel variability, and speaker familiarity, have a significant impact on talker identification. Moreover, although language familiarity did not have a significant effect on talker identification, forward speech yielded significantly higher identification accuracy compared to reversed (i.e., unintelligible) speech.
4.1 Complex interactive effects of adverse conditions on talker identification
In experiments involving speech mixed with noise, this study found that noise exerted a significant adverse effect only on reversed speech, with no such effect observed for forward speech. This finding contrasts with studies on forward speech, which have reported a decline in identification accuracy due to noise [7,8], but it further supports the view that noise exerts a complex influence on talker identification [10,11]. In light of evidence suggesting that reversed speech does not facilitate the formation of short-term memory representations of the speaker for the listener [53–55], we propose a potential hypothesis that talker identification under the reversed speech condition may be more susceptible to noise. Conversely, as listeners are better able to establish short-term memory of the speaker through forward speech, they may experience less interference from noise during talker identification. This hypothesis though requires further systematic auditory and neuroimaging investigations, potentially employing 1-back or n-back experimental paradigms to assess the difficulty of recalling short-term memory for talker identification under conditions of varying speech intelligibility.
The results regarding channel variability support previous fragmented findings [3], showing that talker identification accuracy declines significantly under poor signal transmission (i.e., Landline vs. Landline) and across different channels (i.e., High-quality vs. Landline), following a descending order of High-quality vs. High-quality > Landline vs. Landline > High-quality vs. Landline. Moreover, consistent with previous suggestions that language and channel may exhibit complex interactive effects [3,14], the current study found that in the High-quality vs. High-quality condition, the accuracy for forward speech was superior to that of reversed speech, whereas in the Landline vs. Landline and High-quality vs. Landline conditions, reversed Mandarin and English speech stimuli displayed higher accuracies, respectively. Both the current study and Wang et al.’s work [14] confirm that adverse conditions interact in complex ways rather than through a simple linear summation. This finding underscores the need for future research to build upon these fragmented observations and to conduct more systematic, in-depth investigations into the interactive effects between language and channel.
Surprisingly, this study did not find a significant language familiarity effect on talker identification. Despite being one of the most controversial topics in the literature, most studies have reported significant effects of language familiarity [19,21,33]. The current research revealed that, regardless of the presence or absence of noise, forward speech yielded higher talker identification rates than reversed speech; additionally, forward speech in the High-quality vs. High-quality condition outperformed reversed speech, and the speech stimuli from the four language categories (i.e., Mandarin, English, reversed Mandarin, and reversed English) exhibited a complex pattern when in poor signal transmission and different channel conditions. Based on the results of this study, it appears that the intelligibility of language may play a more critical role in talker identification than phonological familiarity [36]. However, the effect of language familiarity on talker identification under varying channel conditions remains complex and warrants further investigation.
4.2 Modest improvements of lab-training on talker identification
Consistent with previous studies [32–34], the current study found increased speaker familiarity (after lab-training) could led to modest improvements on talker identification accuracy under adverse auditory conditions (e.g., reversed speech, noise, poor signal transmission, and different channels), with overall gains of approximately 3–4%. Kanber et al. [11] argued that 5–10 minutes of training was sufficient to enhance lab-trained voice identification performance (over 80%). By contrast, the lab-training in the current study yielded only limited improvements in talker identification (see Fig 2), potentially due to the fact that the training comprised only two rounds. Future research could conduct more systematic investigations into how different training durations influence talker identification.
Notably, the current study also found interactive effects between speaker familiarity and noise, channel, and language, as evidenced by the inconsistency of the training effect across conditions. While improvements occurred under adverse conditions, listeners’ accuracy for intelligible speech (no noise, high-quality channels) did not improve with training (see Table 3). This inconsistency confirms training benefits were small and selective, limited to adverse auditory scenarios rather than generalizable.
4.3 Implications for forensic speaker identification
In forensic practice, speech is often recorded under varying conditions of noise, channel, and language [56–58]. Auditory examination constitutes a critical component of the acoustic-phonetic paradigm used in forensic speaker identification [4,59]. Therefore, the findings of the current study offer tentative implications for such examinations. First, talker identification is significantly impaired under adverse auditory conditions, which necessitates careful attention to judicial examination procedures. Potential interventions may include speech denoising and signal simulation techniques designed to present speech for identity judgment in conditions that are as optimal as possible [58,60]. Furthermore, when forensic experts encounter unintelligible speech or unfavorable signal conditions, repeated perceptual training to enhance familiarity with the target speaker could yield small improvements in identification accuracy. However, it is important to note that these benefits are not universally observed.
Several limitations of this study warrant discussion. First, this research examined the identification of speech from only four female talkers. Previous studies have reported potential gender differences in talker identification (e.g., male listeners showing higher identification accuracy for male talkers; [61]). Future studies will include additional research on male talkers to confirm these effects. Second, given the repeated mention of the significant application of talker identification in forensic contexts, it remains an interesting topic to explore whether forensic experts differ from untrained listeners. Lastly, to deepen our understanding of the mechanisms underlying talker identification, further research employing neuroscience and brain-imaging techniques is necessary to corroborate the present findings.
5. Conclusion
This study discusses the effects of adverse auditory conditions (i.e., environmental noise, channel variability, language familiarity, and speaker familiarity) on talker identification. The findings indicate that both environmental noise and channel variability negatively impact talker identification. In particular, when the channel transmits poor signals or varies in nature, the accuracy of talker identification is significantly reduced. Furthermore, intelligible language demonstrates superior recognition performance under adverse conditions compared to unintelligible language, and this effect appears to be independent of phonological familiarity. Finally, lab-training designed to enhance speaker familiarity moderately improves talker identification accuracy under adverse auditory conditions, while it has no effect on accuracy under no-noise and high-quality conditions. This study systematically examines the interactive effects of multiple factors on talker identification, thereby enriching our understanding of the underlying auditory mechanisms under various auditory conditions and providing important theoretical support for auditory examination techniques in forensic speaker identification.
Supporting information
S1 Appendix. The results of the statistical analysis.
https://doi.org/10.1371/journal.pone.0339396.s001
(DOCX)
S1 File. The datasets analyzed of the current study.
https://doi.org/10.1371/journal.pone.0339396.s002
(CSV)
References
- 1. Cooper A, Paquette-Smith M, Bordignon C, Johnson EK. The influence of accent distance on perceptual adaptation in toddlers and adults. Language Learning and Development. 2022;19(1):74–94.
- 2. Drozdova P, van Hout R, Scharenborg O. Talker-familiarity benefit in non-native recognition memory and word identification: The role of listening conditions and proficiency. Atten Percept Psychophys. 2019;81(5):1675–97.
- 3. Betancourt KS, Bahr RH. The influence of signal complexity on speaker identification. The International Journal of Speech, Language and the Law. 2011;17(2):179–200.
- 4. Morrison GS, Enzinger E. Introduction to forensic voice comparison. The Routledge Handbook of Phonetics. Routledge. 2019. p. 599–634.
- 5. Lecumberri MLG, Cooke M, Cutler A. Non-native speech perception in adverse conditions: A review. Speech Communication. 2010;52(11–12):864–86.
- 6. Leibold LJ. Speech perception in complex acoustic environments: developmental effects. J Speech Lang Hear Res. 2017;60(10):3001–8. pmid:29049600
- 7. Razak A, Thurston EJ, Gustainis LE, Kidd G, Swaminathan J, Perrachione TK. Talker identification in three types of background noise. J Acoust Soc Am. 2017;141:4039.
- 8. Mamun N, Ghosh R, Hansen JHL. Familiar and unfamiliar speaker recognition assessment and system emulation for cochlear implant users. J Acoust Soc Am. 2023;153(2):1293. pmid:36859118
- 9. Best V, Ahlstrom JB, Mason CR, Roverud E, Perrachione TK, Kidd G Jr, et al. Talker identification: Effects of masking, hearing loss, and age. J Acoust Soc Am. 2018;143(2):1085. pmid:29495693
- 10. Best V, Ahlstrom JB, Mason CR, Perrachione TK, Kidd G, Dubno JR. Effects of age and hearing loss on talker identification and talker change detection. J Acoust Soc Am. 2023;153:A285–A285.
- 11. Kanber E, Lavan N, McGettigan C. Highly accurate and robust identity perception from personally familiar voices. J Exp Psychol Gen. 2022;151(4):897–911. pmid:34672658
- 12. Künzel HJ. Beware of the ‘telephone effect’: the influence of telephone transmission on the measurement of formant frequencies. Forensic Linguist. 2001;8:80–99.
- 13. Hillenbrand J, Getty LA, Clark MJ, Wheeler K. Acoustic characteristics of American English vowels. J Acoust Soc Am. 1995;97(5 Pt 1):3099–111. pmid:7759650
- 14. Wang X, Ge J, Meller L, Yang Y, Zeng F-G. Speech intelligibility and talker identification with non-telephone frequencies. JASA Express Lett. 2024;4(7):075202. pmid:39046893
- 15. Perrachione TK, Del Tufo SN, Gabrieli JDE. Human voice recognition depends on language ability. Science. 2011;333(6042):595. pmid:21798942
- 16. Fleming D, Giordano BL, Caldara R, Belin P. A language-familiarity effect for speaker discrimination without comprehension. Proc Natl Acad Sci U S A. 2014;111(38):13795–8. pmid:25201950
- 17. Garrido L, Eisner F, McGettigan C, Stewart L, Sauter D, Hanley JR, et al. Developmental phonagnosia: a selective deficit of vocal identity recognition. Neuropsychologia. 2009;47(1):123–31. pmid:18765243
- 18. Furbeck K, Thurston EJ, Tin J, Perrachione TK. Perceptual similarity judgments of voices: Effects of talker and listener language, vocal source acoustics, and time-reversal. J Acoust Soc Am. 2018;143:1923.
- 19. Perrachione TK, Furbeck KT, Thurston EJ. Acoustic and linguistic factors affecting perceptual dissimilarity judgments of voices. J Acoust Soc Am. 2019;146(5):3384. pmid:31795676
- 20. Fecher N, Johnson EK. The native-language benefit for talker identification is robust in 7.5-month-old infants. J Exp Psychol Learn Mem Cogn. 2018;44(12):1911–20. pmid:29698034
- 21. Fecher N, Johnson EK. Developmental improvements in talker recognition are specific to the native language. J Exp Child Psychol. 2021;202:104991. pmid:33096370
- 22. Johnson EK, Bruggeman L, Cutler A. Abstraction and the (Misnamed) language familiarity effect. Cogn Sci. 2018;42(2):633–45. pmid:28744902
- 23. Zarate JM, Tian X, Woods KJP, Poeppel D. Multiple levels of linguistic and paralinguistic features contribute to voice recognition. Sci Rep. 2015;5:11475. pmid:26088739
- 24. Narayan CR, Mak L, Bialystok E. Words get in the way: linguistic effects on talker discrimination. Cogn Sci. 2017;41(5):1361–76. pmid:27445079
- 25. Quinto A, Abu El Adas S, Levi SV. Re‐examining the effect of top‐down linguistic information on speaker‐voice discrimination. Cognitive Science. 2020;44(10).
- 26. Johnson J, McGettigan C, Lavan N. Comparing unfamiliar voice and face identity perception using identity sorting tasks. Q J Exp Psychol (Hove). 2020;73(10):1537–45. pmid:32530364
- 27. Lavan N, Burston LFK, Garrido L. How many voices did you hear? Natural variability disrupts identity perception from unfamiliar voices. Br J Psychol. 2019;110(3):576–93. pmid:30221374
- 28. Lavan N, Burston LF, Ladwa P, Merriman SE, Knight S, McGettigan C. Breaking voice identity perception: Expressive voices are more confusable for listeners. Q J Exp Psychol (Hove). 2019;72(9):2240–8. pmid:30808271
- 29. Lavan N, Kreitewolf J, Obleser J, McGettigan C. Familiarity and task context shape the use of acoustic information in voice identity perception. Cognition. 2021;215:104780. pmid:34298232
- 30. Stevenage SV, Symons AE, Fletcher A, Coen C. Sorting through the impact of familiarity when processing vocal identity: Results from a voice sorting task. Q J Exp Psychol (Hove). 2020;73(4):519–36. pmid:31658884
- 31. Njie S, Lavan N, McGettigan C. Talker and accent familiarity yield advantages for voice identity perception: A voice sorting study. Mem Cognit. 2023;51(1):175–87. pmid:35274221
- 32. Hollien H, Didla G, Harnsberger JD, Hollien KA. The case for aural perceptual speaker identification. Forensic Sci Int. 2016;269:8–20. pmid:27855301
- 33. Lloy A, Johnson K, Babel M. Examining the roles of language familiarity and bilingualism in talker recognition. The 13th International symposium on Bilingualism. 2021. 87–140. https://www.khiajohnson.com/pdfs/lloy-johnson-babel-isb13-abstract.pdf
- 34.
Perrachione TK. Recognizing speakers across languages. In: Frühholz S, Belin P, editors. The Oxford Handbook of Voice Perception. Oxford: Oxford University Press; 2019. https:academic.oup.com/edited-volume/38687/chapter/335931302
- 35. Lee JJ, Tin JA, Perrachione TK. Foreign language talker identification training does not generalize to new talkers. J Acoust Soc Am. 2020;148:2763.
- 36. McLaughlin DE, Carter YD, Cheng CC, Perrachione TK. Hierarchical contributions of linguistic knowledge to talker identification: Phonological versus lexical familiarity. Atten Percept Psychophys. 2019;81(4):1088–107. pmid:31218598
- 37.
Champely S, Ekstrom C, Dalgaard P, Gill J, Weibelzahl S, Anandkumar A. Package ‘pwr.’ R package version. 2020.
- 38.
Core Team R. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2020. https://www.R-project.org/
- 39. Cohen J. Things I have learned (so far). Am Psychol. 1990;45:1304–12.
- 40. Jacewicz E, Fox RA, Wei L. Between-speaker and within-speaker variation in speech tempo of American English. J Acoust Soc Am. 2010;128(2):839–50. pmid:20707453
- 41.
Boersma P, Weenink D. Praat: Doing phonetics by computer. 2021. http://www.praat.org/
- 42. Nilsson M, Soli SD, Sullivan JA. Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise. J Acoust Soc Am. 1994;95(2):1085–99. pmid:8132902
- 43. Sharma S, Tripathy R, Saxena U. Critical appraisal of speech in noise tests: a systematic review and survey. Int J Res Med Sci. 2017;5:13–21.
- 44. Galdos M, Simons C, Fernandez-Rivas A, Wichers M, Peralta C, Lataster T, et al. Affectively salient meaning in random noise: a task sensitive to psychosis liability. Schizophr Bull. 2011;37(6):1179–86. pmid:20360211
- 45. Roberts B, Summers RJ, Bailey PJ. The perceptual organization of sine-wave speech under competitive conditions. J Acoust Soc Am. 2010;128(2):804–17. pmid:20707450
- 46. Rosen S, Hui SNC. Sine-wave and noise-vocoded sine-wave speech in a tone language: Acoustic details matter. J Acoust Soc Am. 2015;138(6):3698–702. pmid:26723325
- 47. Slater J, Skoe E, Strait DL, O’Connell S, Thompson E, Kraus N. Music training improves speech-in-noise perception: Longitudinal evidence from a community-based music program. Behav Brain Res. 2015;291:244–52. pmid:26005127
- 48. Souza P, Rosen S. Effects of envelope bandwidth on the intelligibility of sine- and noise-vocoded speech. J Acoust Soc Am. 2009;126(2):792–805. pmid:19640044
- 49. Peirce J, Gray JR, Simpson S, MacAskill M, Höchenberger R, Sogo H, et al. PsychoPy2: Experiments in behavior made easy. Behav Res Methods. 2019;51(1):195–203. pmid:30734206
- 50.
Singmann H, Bolker B, Westfall J, Aust F, Ben-Shachar MS. afex: Analysis of factorial experiments. 2015.
- 51. Barr DJ, Levy R, Scheepers C, Tily HJ. Random effects structure for confirmatory hypothesis testing: Keep it maximal. J Mem Lang. 2013;68(3):10.1016/j.jml.2012.11.001. pmid:24403724
- 52. Lenth R. Emmeans: Estimated Marginal Means, aka Least-Squares Means. 2020. https://CRAN.R-project.org/package=emmeans
- 53. Dougherty SC, Mclaughlin DE, Perrachione TK. A language familiarity effect for talker identification in forward but not time-reversed speech. J Acoust Soc Am. 2015;137:2415.
- 54. El Adas SA, Levi SV. Phonotactic and lexical factors in talker discrimination and identification. Atten Percept Psychophys. 2022;84(5):1788–804. pmid:35641859
- 55. Kreitewolf J, Wöstmann M, Tune S, Plöchl M, Obleser J. Working-memory disruption by task-irrelevant talkers depends on degree of talker familiarity. Atten Percept Psychophys. 2019;81(4):1108–18. pmid:30993655
- 56.
Hollien HF. Forensic voice identification. Academic Press; 2002.
- 57. Lindh J. Forensic comparison of voices, speech and speakers. J R Stat Soc Ser C Appl Stat. 2017;53:109–22.
- 58. Fraser H, Aubanel V, Maher RC, Mawalim C, Wang X, Poc̆ta P, et al. Forensic speech enhancement: toward reliable handling of poor-quality speech recordings used as evidence in criminal trials. J Audio Eng Soc. 2024;72(11):748–53.
- 59. Rana S, Qureshi MA. A comprehensive review of forensic phonetics techniques. ABBDM. 2024;4(02).
- 60.
Ekpenyong M, Obot O. Speech quality enhancement in digital forensic voice analysis. Computational Intelligence in Digital Forensics: Forensic Investigation and Applications. Springer; 2014. p. 429–51.
- 61. Skuk VG, Schweinberger SR. Gender differences in familiar voice identification. Hear Res. 2013;296:131–40. pmid:23168357