Figures
Abstract
Objective
To evaluate the advancements in speech intelligibility testing over the recent decades, with a particular emphasis on the development of audiovisual speech in noise tests that incorporate both auditory and visual modalities for the measurement of speech recognition thresholds.
Design
A scoping review was conducted systematically to examine the existing literature on speech intelligibility testing methods. Following comprehensive screening process, studies were selected for detailed analysis, focusing on audiovisual integration and potential for remote or automated administration within studies methodologies.
Study Sample
The review encompassed 11 scholarly articles that investigated diverse approaches to speech intelligibility testing.
Results
The analysis revealed variability in the accuracy and reliability of speech intelligibility testing methods. Although certain methods demonstrated efficacy in incorporating audiovisual cues, none of the reviewed studies included provisions for remote administration, thereby necessitating the presence of a clinician for test execution. This limitation underscores the imperative for further research development of remote testing methodologies that leverage audiovisual technologies to assess speech in noise.
Conclusions
The findings of this review underscore the critical need for advancement in speech intelligibility testing methodologies particularly integrating audiovisual components and enabling remote administration. The development in this domain holds significant potential to enhance the assessment and implementation of assistive technologies for individuals with hearing impairments.
Citation: Hussain A, Goman AM, Gogate M, Dashtipour K, Kirton-Wingate J, Hussain Z, et al. (2026) Audio-visual speech-in-noise tests for evaluating speech reception thresholds: A scoping review. PLoS One 21(1): e0338600. https://doi.org/10.1371/journal.pone.0338600
Editor: Vidya Ramkumar, Sri Ramachandra Institute of Higher Education and Research (Deemed to be University), INDIA
Received: July 6, 2025; Accepted: November 25, 2025; Published: January 27, 2026
Copyright: © 2026 Hussain et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: No new data were generated in this scoping review. All data supporting the findings of this study were obtained from previously published articles. The study selection process followed the PRISMA-ScR guidelines, and the list of included studies is provided within the manuscript.
Funding: This research is supported by the UK EPSRC COG-MHEAR programme (Grant No. EP/M026981/1). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript”.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Speech perception is a complex multimodal process that integrates both auditory and visual cues, exemplifying multimodal or multisensory integration, wherein various unisensory modalities such as sight, hearing, or touch are combined. Research has demonstrated that language processing is highly interactive, involving the combination of diverse information sources [1]. Speech, generated by the vocal apparatus, is filtered through the configuration of articulatory organs. There is an inherent and perceptible link between the auditory and visual properties of speech, since articulators such as lips, teeth, and the tongue visibly contribute to the process [2–4].
Extensive research in this domain has investigated the phenomenon by which listeners unconsciously engage in lip reading to enhance speech intelligibility in noisy environments [5–8]. Speech intelligibility assessments are generally conducted in clinical or research laboratory settings, where controlled conditions are maintained. However, this presents challenges when attempting to generalise these findings to real-world listening environments, which are characterised by variables such as ambient noise, environmental factors such as wind and machinery noise, and presence of multiple concurrent conversations [9].
In the context of audiology and hearing loss, it is currently estimated that around 5% of the global population, equivalent to 430 million people, experience hearing impairment [10]. The prevalence of hearing loss increases noticeably with over 25\% of individuals aged 60 and above affected by disabling hearing loss, which refers to hearing loss greater than 40 dB in the better-hearing ear in adults and greater than 30 dB in the better hearing ear in children [10,11]. Despite continuous advancements in research and technology, currently hearing assessments predominantly rely on audio-only (AO) methodologies.
Pure tone audiometry (PTA) is standard procedure for the identification and assessment of hearing loss and its severity. This diagnostic approach enables clinicians to accurately determine the extent of hearing impairment, thereby facilitating informed counseling and the provision of tailored recommendation to patients. PTA is widely regarded as the gold standard and most frequently employed test for detecting of hearing loss [12,13]. Although PTA provides essential data regarding a listener’s hearing sensitivity, Parmar et al. [14] found that hearing healthcare professionals view speech testing as particularly valuable for offering patients relatable insights into their functional hearing abilities. This information is crucial for guiding hearing aid fittings and constitutes as vital component of the comprehensive diagnostic test battery. These assessments are conducted under both aided and unaided conditions, encompassing evaluations in quiet as well as in the presence of background noise.
Speech-in-noise (SIN) tests are effective tools for assessing hearing loss across diverse populations and languages. A wide array of commercially available audio-only (AO) SIN tests are currently available, many of which are suitable for both adults and children. The AO SIN tests evaluate performance at the sentence, word, or phonemic level and include, both adaptive and fixed signal-to-noise ratio (SNR) tests are available. Fixed SNR tests include the Connected Speech Test (CST) [15] and the Speech Perception in Noise Test (SPIN) [16]. Adaptive tests, such as the Hearing In Noise Test (HINT) [17], Quick Speech In Noise (QSIN) [18], the Words-in-Noise Test (WIN) [19] and the Bamford–Kowal–Bench (BKB) SIN test [20], are commonly employed in clinical settings. These tests typically involve a target speaker delivering the specific material (sentence, word, or phoneme) amid background noise, which varies depending on the test. Background noises range from multi-talker babble to speech-shaped noise, with adaptive testing featuring variable noise or speech level, while fixed SNR tests maintaining constant noise levels. Stimuli may be presented through headphones to provide ear-specific results or via soundfield loudspeakers. Participants are instructed to repeat what they hear, and clinicians score the response based on the number of accurately recognised keywords, determining the percentage of correct words or the speech recognition threshold at 50% intelligibility, depending on the test employed.
Although the benefits of employing speech testing for guiding hearing aid fitting and as a component within a diagnostic test battery are well recognised, the widespread adoption of this practice remains limited. Certain countries [14], such as Canada and India, recommend speech testing as an essential component of audiology practice, whereas others, such as the UK, do not. Parmar et al. [14] have identified a lack of clinical time, inadequate training, and insufficient equipment as key factors contributing to the limited implementation of speech testing within a diagnostic battery, a trend particularly evident in the UK and likely to the observed global variability in service provision [14].
In addition to the limited adoption of SIN testing, current testing protocols exhibit several others limitations, including their failure to replicate real world scenarios [21] and the lack of integration of visual cues, which could enable listeners to benefit from auditory and visual information. The incorporation of visual cues into speech perception tests has been extensively investigated with a substantial body of research establishing that speech comprehension is significantly enhanced when both auditory and visual modalities are engaged [5,7,22,23]. This research consistently demonstrates the advantages of incorporating visual cues, particularly in environments where auditory signals are degraded. For instance, a study conducted by [7] with normal hearing adults revealed a marked improvement in word recognition when both auditory and visual cues were presented, compared to auditory input alone. In particular, the inclusion of visual speech information was found to enhance performance to a level equivalent to a 15 dB increase in SNR over AO conditions. Similarly, Gagné and Wittich [24] underscored the importance of visual cues for older adults with hearing loss, reporting an average 18% improvement in speech recognition when visual information was incorporated. These findings underscore the imperative to integrate visual cues into SIN testing protocols, particularly for populations where auditory processing alone may be insufficient, thus ensuring more accurate assessments of speech perception in realistic listening environments. Another critical consideration when integrating visual elements into speech tests is the specific contribution of visual input and the extent of its benefit. As noted by Tye-Murray et al. [25], lipreading (visual-only speech perception) plays a significant role in audiovisual (AV) speech perception, accounting for up to 60% of the variance in individual AV speech perception measures.
Expanding on the integration of visual cues in speech testing, it is also essential to consider the role of Speech Reception Threshold (SRT) and other speech perception measures in evaluating auditory performance. The SRT is a widely utilised metric in AO SIN tests that determines the minimum SNR at which a listener can correctly identify speech 50% of the time. Fixed SNR tests, which assess the percentage of correct responses at a predetermined SNR, offer the advantage of providing a straightforward evaluation of hearing aid benefit, facilitating patients comprehension. However, a critical limitation of these tests is the difficulty of selecting an appropriate SNR. If the SNR is set too low, the results may underestimate the true benefit of the hearing aids. Conversely, if the SNR is set too high, the perceived benefit may be overstated. This issue is particularly salient among high-performing cochlear implant users, where traditional fixed SNR tests may result in ceiling effects, thus failing to accurately differentiate level of performance. In contrast, adaptive SRT tests, which dynamically adjust the SNR based on the listener’s performance, offer more nuanced assessment capable of better distinguishing performance across a wide range of abilities [26]. This adaptive approach is essential for capturing the full spectrum of auditory processing capabilities and ensuring that the results are both meaningful and reflective of real-world listening conditions. While percentage-based tests and SRT measurements offer different perspectives on speech perception abilities, SRT testing can provide more fine-grained insights into specific aspects of hearing performance. Both approaches are interconnected, as an SRT can be derived from a psychometric function of percentage correct versus decibel level, and conversely, a percentage-versus-decibel curve can be calculated from an adaptive SRT test. The choice between these methods often depends on the specific experimental focus and goals. The implementation of adaptive SRT test in noise would serve as a valuable tool for assessing hearing capabilities and comparing hearing aid systems in both clinical and research settings, as these tests reveal variation in psychometric function slopes and offer a more comprehensive range of performance levels [26,27].
Traditionally, hearing services have predominantly been provided in hospitals or clinics, where testing is conducted by healthcare professionals. This centralised model of care presents notable limitations in accessibility, particularly for certain populations and specific circumstances. The shortage of qualified professionals, especially in low- and middle-income countries, constitutes a significant barrier to access [28]. Recent global insults, such as the COVID-19 pandemic, have underscored another critical limitation of clinic-based services: their susceptibility to disruption during public health crises. Safety measures, including lockdowns implemented to control the spread of infectious diseases, can severely impede access to conventional hearing healthcare services [29]. In response to these limitations and a need to enhance accessibility to hearing assessments, there has been a increase shift towards the development of remote and mobile-based AO SIN tests. These innovative approaches seek to democratise access to hearing screening and assessment tools, enabling individuals to undergo preliminary evaluations without necessitating in-person clinic visits. Several AO SIN tests have been adapted or newly developed with remote capabilities, using internet-based platforms or smartphone applications [30–32]. These tools hold significant potential for the widespread screening and monitoring of hearing health, particularly in underserved areas or during periods when physical access to healthcare facilities is restricted.
This comprehensive review aims to critically examine the evolution and current state of speech intelligibility testing, with a particular focus on SRT assessments that have integrated AV elements over the recent decades. The scoping review will systematically identify and evaluate studies that have incorporated visual components into SRT assessments, exploring their methodologies, outcomes, and potential clinical applications. Additionally, the review will investigate the extent to which these AV SRT tests have been adapted for remote administration, addressing a critical gap in accessibility for individuals in geographically isolated regions or those with technological or mobility constraints that hinder access to traditional clinical settings. Furthermore, this review will not only provide a comprehensive overview of the current landscape of AV SRT testing, but also to critically analyse the potential of these methods to enhance diagnostic accuracy, thereby informing development of an AV SIN test. Drawing on the findings of this scoping study, recommendation can be formulated regarding the applicability of existing methodologies or stimuli in development of a new remote British English AV SIN test. Consequently, this review underscores the necessity for future research to focus on the developing and validating AV speech tests that can be administered remotely or via mobile applications, while ensuring the reliability and validity of clinical assessments. Such advancements hold the potential to significantly improve to accessibility, comprehensiveness, and patient centreness of hearing healthcare services and could facilitate the future development of multimodal hearing aids by manufacturers.
Methods
A scoping review was conducted, following the methodology outlined by Arksey and O’Malley [33]. These findings were systematically presented in the following sequence: (1) define a research question, (2) identify pertinent studies, (3) select relevant studies, (4) charting the data, and (5) collating, summarise, and reporting the results. This review adhered to the guidelines of the preferred reporting items for scoping reviews (PRISMA-ScR) [34].
Identifying the research question
The primary research question addressed was:’Have any previously developed or researched AV SIN tests been used to measure SRT?’ A secondary question investigated which of these tests incorporated remote or automated functionalities?
Identifying relevant studies
In this phase of study, we sought to establish the criteria for selecting publications to be included in the scoping review. Although scoping studies are inherently broad in scope, we deliberately identified specific criteria to guide our search process. The search strategy utilised for electronic databases was formulated based on our research questions and key concept underpinning the study. Prior to commencing the searches, two authors (AH and AG) reached a consensus on the relevant keywords for article retrieval. Searches were conducted from earliest records available up to February 18, 2025. The databases selected for this review included the Cochrane Library, IEEExplore, PubMed, Science Direct, and Web of Science, chosen for their comprehensive coverage of topics pertinent to health, engineering, social sciences, and psychology. Repositories containing grey literature were excluded from this systematic search, due to the absence of peer review which raised concerns regarding the authenticity, reliability, and reproducibility of the included work [35]. This exclusion was enforced to ensure that the retrieved literature would directly contribute to addressing the research question. The search terms employed in the database queries included “speech in noise,” “speech intelligibility,” “speech perception,” “speech recognition in noise,” “audiovisual,” “audio-visual,” and “auditory-visual.” These terms were consistently applied across all databases using boolean operators to structure the queries. Specifically, the key terms within each axis were combined using the “OR” operator and the search strategies for both axes were linked using the “AND” operator. The search commands were as follows: (“speech in noise” OR “speech intelligibility” OR “speech perception” OR “speech recognition in noise”) AND (“audiovisual” OR “audio-visual” OR “auditory-visual”). To minimise the risk of excluding relevant studies, no restrictions were placed on publication dates or languages.
Study selection
All studies published in English that addressed AV SIN tests which used SRT as measurement were considered for inclusion in this review. The primary objective was to identify any AV SIN tests that had been developed without imposing any restrictions on the date range for inclusion. Additionally, no limitations were placed on sample size or study design, as the focus was on analysing the methodologies and development employed in previous research within the scope of this review.
Incomplete papers, opinion pieces, book chapters, editorials, and grey literature were excluded from further review. The management of all search results citations including the identification and merging of duplicates was conducted using Paperpile LLC 2024 software. Subsequently, three stage selection process was implemented to evaluate the articles.
The inclusion criteria were established based on the following parameters: (1) studies involving AV SIN testing (2) no age restriction for participants, (3) measurement of the SRT, 4) no limitation on the language of stimuli, and (5) articles written in English. Articles excluded were: (1) measurements without SRT, (2) were not written in English, and 3) consisted of clinical commentaries, editorials, interviews, letters, newspaper articles, abstracts only, or non-peer-reviewed literature (e.g., thesis).
During stage 1, two authors (AH and KD) independently reviewed the titles and abstracts of all articles related to AV SIN tests. In the second stage, the full-text articles were meticulously examined by the same authors (AH and KD) to assess their eligibility. The final stage involved a third author (AG) who evaluated the articles flagged by the previous reviewers, due to discrepancies, resolving any disagreement to make the final determination on inclusion.
Ultimately, fully eligible articles were chosen for further analysis and synthesis (Fig 1).
Charting the data
The subsequent phase involved systematically organising the data extracted from the primary research reports under review. The data was structured according to the main research question to emphasise the primary findings. Each paper was meticulously read multiple times to ensure a comprehensive understanding of its aims, objectives, and findings, to ensure no relevant information was overlooked. The following information was systematically documented from each article: author(s), year of publication, study aim, research context, study participants, outcome measures, stimuli, language, test procedures and main findings.
Collating, summarising and reporting results
Papers meeting the inclusion criteria were meticulously examined, with their content subjected to rigorous analysis and assessment. Recurring themes, the various methodologies employed for recording the AV material, and the procedures utilised for determining SRT were critically scrutinised and presented in the following section.
Results
Study selection
Articles were retrieved from five online databases, resulting in an initial set of 5261 papers. After eliminating duplicates and excluding irrelevant studies, 3536 articles remained. Title screening then led to the exclusion of 3340 articles, followed by abstract screening which further excluded 109 articles. This process resulted in a final selection of 87 articles for a comprehensive full-text review. Throughout this process, 76 articles were excluded based on the inclusion criteria due to reasons such as not having an SRT measurement, lack of AV integration or not being written in English. Consequently, the final number of studies included to address the research question was 11.
Study characteristics
The following reviews each of the 11 included studies, summarising their aims, methodologies and findings (Table 1).
Stimuli and measures
An analysis of the studies revealed substantial variability in both the speech stimuli and measurement methodologies employed for AV SIN testing. The speech materials ranged from basic digit triplets to more intricate sentence-based stimuli. Several studies utilised standardised speech tests, such as matrix sentence tests, BKB sentences, and IEEE sentences, while others developed bespoke materials, including passages from the CST or novel sentence lists. In terms of measurement, most studies focused on determining SRTs, albeit with varying target percentages. Whilst many studies employed the conventional 50% correct performance criterion, others explored alternative thresholds. such as 80% SRT, whilst another study examined multiple thresholds, including 5%, 50%, 80%, and 95%. Furthermore, another study calculated SRTs based on the mean of the final six SNR ratio values. These variations in SRT percentages reflect ongoing efforts to optimise sensitivity and mitigate ceiling effects in AV testing. Adaptive procedures were commonly used for efficient SRT estimation, although some studies incorporated fixed SNRs to characterise performance across diverse listening conditions. Scoring methodologies encompassed both keyword and whole-sentence approaches (Table 2).
Visual integration
The studies reviewed utilised a range of methods to present visual speech information and assess its integration with auditory cues. Several studies employed video recordings of real speakers L. Arnold et al. [36]; Bernstein and Grant [37]; Cox et al. [38]; Llorach et al. [39, 40]; Van de Rijt et al. [41]; Le Rhun et al. [42] providing naturalistic visual speech cues that facilitated the integration process. In contrast, other studies employed virtual human speakers Choudhary et al. [43]; Devesse et al. [44]; Schreitmüller et al. [45], which despite sacrificing some realism, allowed for enhanced control over visuals elements such as lip contrast and head scaling. Although the extent of visual benefit varied across studies, it was consistently significant, with improvements in SRTs ranging from 1.5 to 5 dB when visual cues were incorporated. Furthermore, Van de Rijt et al. [41] and Bernstein and Grant [37] provided empirical support for the principle of inverse effectiveness, demonstrating that the benefit of visual cues was maximised at intermediate SNRs where AO performance was neither excessively high nor low.
Remote testing
A key finding of this review was the lack of remote testing capabilities in the AV SIN tests examined. None of the 11 studies incorporated methods for administering tests remotely or via telehealth platforms. All tests were conducted in controlled laboratory or clinical environments, with participants required to be physically present for the testing sessions.
Future innovations could leverage virtual reality (VR) to create immersive and standardised testing environments or employ AI-generated, photorealistic avatars to offer precise control over visual speech cues, overcoming the inconsistencies of video recordings.
Discussion
This comprehensive scoping review focuses on AV SIN tests with SRT measurements, aimed to (i) identify developed AV SIN tests specifically designed to measure SRT, (ii) evaluate the remote testing capabilities of these assessments. In analysing the search results, we identified methodologies employed in the development of various tests. Our investigation revealed 11 studies demonstrating considerable variability in functionality, with several requiring further development or validation. This highlights both significant progress in the field and areas necessitating future development. Due to our stringent screening criteria, the number of research studies included in this chapter is substantially lower than in comparable reviews [46] exploring AV speech perception. However, this focused sample enables us to thoroughly examine the methodologies utilised in previous studies and gain valuable insights into techniques that can be adapted and implemented in future research.
Speech material and masking noise
The reviewed studies demonstrate significant progress in developing AV SIN tests across different languages and populations, using a diverse array of speech materials and masking noise. The studies included matrix sentence tests [39,42,45], which are beneficial due to their highly controlled vocabulary and syntactic structure, ease of adaptability across languages, and extensive number of possible combinations, rendering them suitable for repeated measures. Additionally, more naturalistic sentence materials [44] have been validated and standardised for clinical application, while word-level stimuli (L. [36]) represent an alternative approach.
Choudhary et al. [43] used digit triplets, which offer simplicity and rapid administration, thereby minimizing linguistic confounds. However, this material does not accurately reflect the complexities of real-world communication challenges, nor does it assess sentence-level processing. MacLeod and Summerfield [40] utilised BKB sentences, which are standardised and widely used in research. Despite their utility, these sentences may be constrained by the limited number of available material and may not adequately reflect the complexity of adult conversational discourse. Bernstein and Grant (2009) utilised IEEE sentences, which are phonetically balanced and commonly employed in speech testing. Although this corpus is known for its low context predictability, it could be limited by the number of sentences available when compared to matrix sentences (720 sentences versus the possibility of 100,000 possible sentence combinations). Cox et al. [38] used material from the CST test, which consists of 48 passages of conversational speech. While this test has high ecological validity, the use of conversational speech makes it more difficult to control for linguistic factors, and scoring can be more complex and time-consuming.
This variety reflects a tension between the need for controlled, comparable stimuli and the desire for ecological validity. Future development of AV SIN tests should strategically balance competing factors by incorporating a diverse range of speech materials. These materials should include single words and sentences, thereby enabling a comprehensive assessment of AV speech perception across multiple levels of linguistic complexity, from phoneme recognition to contextual comprehension.
The diversity of masking noises used across studies, including speech-shaped noise, multi-talker babble, and modulated noise, whilst some reflect the complexity of real-world listening environments others are used for laboratory experiments. While speech-shaped noise offers consistent energetic masking, it may not fully capture the informational masking effects encountered in everyday listening situations. The comparative analysis of various noise types within individual studies, as done by Bernstein and Grant [37], is particularly valuable in elucidating how different masker characteristics interact with AV integration processes. These maskers often fall into two categories; energetic masking, where noise overlaps spectrally with the speech signal, obstructing it at the auditory periphery and informational masking which arises from cognitive interference, such as competing speech signals that are perceptually similar to the target. Future AV SIN tests could benefit from incorporating adaptive technologies that manipulate both the type and intensity of background noise to offer a more comprehensive assessment of AV speech perception in challenging listening conditions.
It is critical to note that the studies identified in this review were exclusively conducted in Western countries, utilising materials in English, German, French, and Dutch. This reveals a significant geographic and linguistic gap in the literature. There is a notable absence of AV-SIN test development and validation in languages from Asia, Africa, and South America. Given that linguistic and cultural factors can significantly influence speech perception, the direct application of existing tests to diverse global populations is not appropriate. Future research should prioritise the development of culturally and linguistically adapted AV-SIN tests to ensure that the benefits of this assessment methodology are accessible globally and relevant to diverse patient populations.
Procedural aspects and scoring methods
The review underscored several important methodological considerations for AV SIN testing, including the choice between adaptive procedures and fixed SNR measurements, as well as differences in scoring methods and the number of trials. Adaptive procedures for estimating SRTs offer efficiency and precision but may not capture the full range of performance across different SNRs. The approach taken by Van de Rijt et al. [41] of measuring performance across a fixed range of SNRs provides valuable insights into the shape of the psychometric function for AV speech perception, revealing important effects such as inverse effectiveness [41]. Inverse effectiveness is the principle that the benefit of combining auditory and visual information grows as the individual signals become less reliable.
Another crucial aspect to consider is the selection of an appropriate SRT percentage. Studies have demonstrated that a 50% threshold may be too easy to reach when visual cues are present due to the benefits of lipreading [39]. Scoring methods also varied, ranging from keyword scoring for sentences to more detailed phoneme-level scoring, each offering different balances between efficiency and the depth of information obtained. Additionally, studies differed in trial count, with test lists ranging from 10 to 30 sentences, underscoring the need for further research to find the optimal balance between test reliability and administration time.
Future advancement in AV SIN tests could benefit from integrating diverse approaches. Employing adaptive procedures to rapidly estimate SRTs while alongside fixed SNRs to characterise the full psychometric function, may enhance test accuracy. Additionally, incorporating multiple scoring levels could provide a comprehensive assessment, offering both overall intelligibility measures and detailed insights into AV integration.
Visual integration
The exploration of both video recordings and virtual human speakers for presenting visual speech information signifies an important area of innovation in AV SIN testing. Although video recordings of real speakers currently offer the most naturalistic visual cues, the potential benefits of virtual humans, in terms of stimulus control and flexibility, make this a promising avenue for future research. An important consideration is the simultaneous recording of both visuals and auditory, as studies have shown that dubbing during post-processing can introduce inconsistencies [39, 42].
The consistent finding of significant visual benefits across studies, with SRT improvements ranging from 1.5 to 5 dB, underscores the importance of incorporating visual cues in speech intelligibility assessments. The evidence for inverse effectiveness observed in several studies has significant implications for test design and the results interpretation. Future AV SIN tests should be designed to capture this phenomenon, by adjusting SNRs to optimise AV integration for each individual. Additionally, further research is needed to examine individual differences in the capacity to integrate auditory and visual speech cues, which have meaningful implications for developing rehabilitative strategies in clinical populations.
Remote testing
The absence of remote testing capabilities in current AV SIN assessments highlights a significant gap in the field. As technology has advanced, particularly in recent years, the potential for remote administration of speech perception tests has grown significantly. With the growing importance of telehealth in audiology and the global shift towards remote healthcare delivery [47,50], it is imperative to prioritise the development of AV SIN tests that can be reliably administered in remote settings. Leveraging improved connectivity and emerging telehealth platforms, future research may increasingly incorporate remote testing methods to address this need.
However, the transition to remote AV SIN testing presents several challenges. Ensuring consistent audiovisual presentation across diverse devices and varying internet connections is critical. Additionally, safeguarding test integrity in unsupervised settings poses unique difficulties. Potential solutions may involve the development of specialised software or web-based platforms for AV SIN test delivery, integrating automated calibration procedures to standardise device output, and employing advanced encryption and authentication technologies to protect test materials.
Rigorous research will also be necessary to validate remote versions of AV SIN tests by comparing them with in-person administration to ensure equivalence of results. The successful advancement of remote AV SIN testing capabilities would substantially enhance the accessibility and clinical applicability of these assessments, facilitating broader adoption in both research and clinical practice. This development has the potential to revolutionise speech perception testing, bridging gaps in accessibility and aligning with the broader trends in telehealth innovation.
Strengths and limitations
A primary strength of this scoping review is its specific focus on SRTs measured through AV tests. This focus, however, inherently excluded a substantial body of research that examines AV speech perception without formally quantifying thresholds. Much of the research within the fields of psychology and neuroscience explores behavioural and neurological mechanisms underlying the integration of auditory and visual inputs during language processing without calculating speech reception scores. As a result, our review may have overlooked significant insights that this broader literature on AV speech perception could have provided. Nevertheless, we chose to focus specifically on SRTs because they offer a quantifiable and clinically relevant measure of speech intelligibility, which is particularly important for evaluating and comparing the performance of AV-based hearing assessments and interventions. This strategic focus allowed for the identification of AV assessments that were expressly designed and validated to measure speech intelligibility thresholds. Although the application of the SRT criterion resulted in a more limited pool of eligible studies, it allowed for a more focused mapping of AV testing methodologies specifically designed to quantify visual speech intelligibility gains. By narrowing the scope in this way, the review highlights the range of existing tools, their key characteristics, and areas where further methodological development or validation may be needed.
Furthermore, the decision to include only studies published in peer-reviewed journals may have introduced a degree of publication bias. Research on AV speech testing that is documented in unpublished manuscripts, conference proceedings, and dissertations was likely excluded. Expanding the scope of future reviews to include this so-called grey literature could yield additional perspectives and insights. An additional advantage of our inclusion criteria was the lack of restrictions regarding publication dates. This inclusive approach provided a comprehensive view of the development trajectory of AV speech testing, encompassing efforts that span several decades. By reviewing foundational works, we were able to identify early pioneers who explored the use of visual cues to enhance speech perception, striving towards the optimisation and validation of AV speech tests.
Similarly, our review identified a lack of research focusing on diverse populations. Although one study developed a test for paediatric use, most of the research cantered on monolingual adults with normal hearing or post-lingual hearing loss. The unique challenges faced by multilingual individuals, pre-lingually deaf children, or individuals with comorbid cognitive or visual impairments were largely unaddressed. The applicability of these findings to populations in low-resource settings, where access to technology and clinical expertise is limited, also remains unexplored. This limits the global generalisability of the current body of evidence and underscores the need for research that is more inclusive of diverse participant groups.
Conclusions
In conclusion, this review identified 11 studies conducted in English, Dutch, French and German languages and specialised research domains, providing a comprehensive evaluation of existing AV SIN tests and methodologies. The analysis revealed substantial variability across studies highlighting the necessity for further research to achieve adoption of these tests. A key finding emphasised the importance of carefully calibrating scoring percentages and threshold criteria in future test designs to mitigate ceiling and floor effects, which may otherwise limit test sensitivity.
Significantly, none of the current AV assessments reviewed incorporate remote administration capabilities, underscoring the significant gap in the field. This presents a substantial opportunity to utilise telehealth innovations, thereby increasing accessibility to testing for populations constrained by geographic location, mobility issues, or availability. Additionally, careful considerations of the speech materials used in these assessments is critical to optimising reliability and validity. Factors such as the selection of talkers and linguistic complexity play a pivotal role in shaping test outcomes. Additionally, the design of adaptive procedures must carefully balance precision in threshold measurement, test efficiency, and the minimisation of participants fatigue or demotivation.
Therefore, the development a novel AV British English SIN test should prioritise addressing these identified gaps and consideration. The successful development of a remote test of this nature has the potential to transform clinical practice by facilitating more frequent assessments without the need for clinician involvement or specialised testing environments like soundproof booths. Moreover, such advancements, particularly those integrating telehealth platforms with emerging technologies like virtual reality and artificial intelligence, would be invaluable for researchers focused on developing ecologically valid assessments that address the limitations of current AO protocols and pave the way for truly patient-centred hearing healthcare.
Acknowledgments
I thank all co-authors for their support and the extended team for their helpful advice on this study.
References
- 1.
Rosenblum LD. Primacy of multimodal speech perception. The handbook of speech perception. 2005. p. 51–78.
- 2. Campbell R. The processing of audio-visual speech: empirical and neural bases. Philos Trans R Soc Lond B Biol Sci. 2008;363(1493):1001–10. pmid:17827105
- 3.
Hussain A, Barker J, Marxer R, Adeel A, Whitmer W, Watt R, Derleth P. Towards multi-modal hearing aid design and evaluation in realistic audio-visual settings: Challenges and opportunities. In First international conference on challenges in hearing assistive technology (chat-17) stockholm, sweden. 2017.
- 4. Rosenblum LD. Speech perception as a multimodal phenomenon. Curr Dir Psychol Sci. 2008;17(6):405–9. pmid:23914077
- 5. Erber NP. Auditory-visual perception of speech. J Speech Hear Disord. 1975;40(4):481–92. pmid:1234963
- 6. Guellaï B, Streri A, Yeung HH. The development of sensorimotor influences in the audiovisual speech domain: Some critical questions. Front Psychol. 2014;5:812. pmid:25147528
- 7. Sumby WH, Pollack I. Visual contribution to speech intelligibility in noise. J Acoust Soc Am. 1954;26(2):212–5.
- 8. Vatikiotis-Bateson E, Eigsti IM, Yano S, Munhall KG. Eye movement of perceivers during audiovisual speech perception. Percept Psychophys. 1998;60(6):926–40. pmid:9718953
- 9. Miles K, Beechey T, Best V, Buchholz J. Measuring speech intelligibility and hearing-aid benefit using everyday conversational sentences in real-world environments. Front Neurosci. 2022;16:789565. pmid:35368279
- 10.
World Health Organization. Deafness and hearing loss. https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-loss. 2023. Accessed 2023 December 3.
- 11. Akeroyd MA, Munro KJ. Population estimates of the number of adults in the UK with a hearing loss updated using 2021 and 2022 census data. Int J Audiol. 2024;63(9):659–60. pmid:38747510
- 12.
Barlow C, Davison L, Ashmore M. Variation in tone presentation by pure tone audiometers: the potential for error in screening audiometry. In: Proceedings of euronoise 2015, 2015.
- 13. Barlow C, Davison L, Ashmore M, Weinstein R. Amplitude variation in.Calibrated audiometer systems in clinical simulations. Noise Health. 2014;16(72):299–305.
- 14. Parmar BJ, Rajasingam SL, Bizley JK, Vickers DA. Factors affecting the use of speech testing in adult audiology. Am J Audiol. 2022;31(3):528–40.
- 15. Cox RM, Alexander GC, Gilmore C. Development of the connected speech test (CST). Ear Hear. 1987;8(5 Suppl):119S-126S. pmid:3678650
- 16. Bilger RC, Nuetzel JM, Rabinowitz WM, Rzeczkowski C. Standardization of a test of speech perception in noise. J Speech Hear Res. 1984;27(1):32–48. pmid:6717005
- 17. Nilsson M, Soli SD, Sullivan JA. Development of the Hearing in Noise Test for the measurement of speech reception thresholds in quiet and in noise. J Acoust Soc Am. 1994;95(2):1085–99. pmid:8132902
- 18. Killion MC, Niquette PA, Gudmundsen GI, Revit LJ, Banerjee S. Development of a quick speech-in-noise test for measuring signal-to-noise ratio loss in normal-hearing and hearing-impaired listeners. J Acoust Soc Am. 2004;116(4 Pt 1):2395–405. pmid:15532670
- 19. Wilson RH. Development of a speech-in-multitalker-babble paradigm to assess word-recognition performance. J Am Acad Audiol. 2003;14(9):453–70. pmid:14708835
- 20.
Research E. Bamford-kowal-bench speech-in-noise test (version 1.03). Elk Grove Village, IL: Author. 2005.
- 21. Badajoz-Davila J, Buchholz JM. Effect of test realism on speech-in-noise outcomes in bilateral cochlear implant users. Ear Hear. 2021;42(6):1687–98. pmid:34010247
- 22. Arnold P, Hill F. Bisensory augmentation: A speechreading advantage when speech is clearly audible and intact. Br J Psychol. 2001;92 Part 2:339–55. pmid:11802877
- 23. Erber NP. Auditory, visual, and auditory-visual recognition of consonants by children with normal and impaired hearing. J Speech Hear Res. 1972;15(2):413–22. pmid:5047880
- 24.
Gagné J-P, Wittich W. Visual impairment and audiovisual speech perception in older adults with acquired hearing loss. In: Hearing care for adults: The challenge of aging. proceedings of the second international conference, 2009. 165–77.
- 25. Tye-Murray N, Spehar B, Myerson J, Hale S, Sommers M. Lipreading and audiovisual speech recognition across the adult lifespan: Implications for audiovisual integration. Psychol Aging. 2016;31(4):380–9. pmid:27294718
- 26. Poissant SF, Bero EM, Busekroos L, Shao W. Determining cochlear implant users’ true noise tolerance: Use of speech reception threshold in noise testing. Otol Neurotol. 2014;35(3):414–20. pmid:24518402
- 27. Hu W, Swanson BA, Heller GZ. A Statistical method for the analysis of speech intelligibility tests. PLoS One. 2015;10(7):e0132409. pmid:26147290
- 28. Mulwafu W, Ensink R, Kuper H, Fagan J. Survey of ENT services in sub-Saharan Africa: Little progress between 2009 and 2015. Glob Health Action. 2017;10(1):1289736. pmid:28485648
- 29. Hussain A, Hussain Z, Gogate M, Dashtipour K, Ng D, Riaz MS, et al. Impact of the Covid-19 pandemic on audiology service delivery: Observational study of the role of social media in patient communication. PLoS One. 2024;19(4):e0288223. pmid:38662689
- 30.
Almufarrij I, Dillon H, Dawes P, Thodi C, Stone M, Charalambous AP, et al. Web-and app-based tools for remote hearing assessment: a scoping review protocol. 2020.
- 31. Motlagh Zadeh L, Brennan V, Swanepoel DW, Lin L, Moore DR. Remote self-report and speech-in-noise measures predict clinical audiometric thresholds. Int J Audiol. 2025;64(6):618–26. pmid:39109478
- 32. Paglialonga A, Polo EM, Zanet M, Rocco G, van Waterschoot T, Barbieri R. An automated speech-in-noise test for remote testing: Development and preliminary evaluation. Am J Audiol. 2020;29(3S):564–76. pmid:32946249
- 33. Arksey H, O’Malley L. Scoping studies: towards a methodological framework. Int J Soc Res Methodol. 2005;8(1):19–32.
- 34. Tricco AC, Lillie E, Zarin W, O’Brien KK, Colquhoun H, Levac D, et al. PRISMA Extension for scoping reviews (PRISMA-ScR): Checklist and explanation. Ann Intern Med. 2018;169(7):467–73. pmid:30178033
- 35. Haddaway NR, Collins AM, Coughlin D, Kirk S. The role of google scholar in evidence reviews and its applicability to grey literature searching. PLoS One. 2015;10(9):e0138237. pmid:26379270
- 36. Arnold L, Boyle P, Canning D. Development of a paediatric audiovisual speech test in noise. Cochlear Implants Int. 2010;11 Suppl 1:244–8. pmid:21756624
- 37. Bernstein JGW, Grant KW. Auditory and auditory-visual intelligibility of speech in fluctuating maskers for normal-hearing and hearing-impaired listeners. J Acoust Soc Am. 2009;125(5):3358–72. pmid:19425676
- 38. Cox RM, Alexander GC, Gilmore C, Pusakulich KM. The Connected speech test version 3: audiovisual administration. Ear Hear. 1989;10(1):29–32. pmid:2470629
- 39. Llorach G, Kirschner F, Grimm G, Zokoll MA, Wagener KC, Hohmann V. Development and evaluation of video recordings for the OLSA matrix sentence test. Int J Audiol. 2022;61(4):311–21. pmid:34109902
- 40. MacLeod A, Summerfield Q. Quantifying the contribution of vision to speech perception in noise. Br J Audiol. 1987;21(2):131–41. pmid:3594015
- 41. van de Rijt LPH, Roye A, Mylanus EAM, van Opstal AJ, van Wanrooij MM. The Principle of inverse effectiveness in audiovisual speech perception. Front Hum Neurosci. 2019;13:335. pmid:31611780
- 42. Le Rhun L, Llorach G, Delmas T, Suied C, Arnal LH, Lazard DS. A standardised test to evaluate audio-visual speech intelligibility in French. Heliyon. 2024;10(2):e24750. pmid:38312568
- 43. Datta Choudhary Z, Bruder G, Welch GF. Visual facial enhancements can significantly improve speech perception in the presence of noise. IEEE Trans Vis Comput Graph. 2023;29(11):4751–60. pmid:37782611
- 44. Devesse A, Dudek A, van Wieringen A, Wouters J. Speech intelligibility of virtual humans. Int J Audiol. 2018;57(12):908–16. pmid:30261770
- 45. Schreitmüller S, Frenken M, Bentz L, Ortmann M, Walger M, Meister H. Validating a method to assess lipreading, audiovisual gain, and integration during speech reception with cochlear-implanted and normal-hearing subjects using a talking head. Ear Hear. 2018;39(3):503–16. pmid:29068860
- 46. Basharat A, Thayanithy A, Barnett-Cowan M. A scoping review of audiovisual integration methodology: Screening for auditory and visual impairment in younger and older adults. Front Aging Neurosci. 2022;13:772112. pmid:35153716
- 47. D’Onofrio KL, Zeng F-G. Tele-audiology: current state and future directions. Front Digit Health. 2022;3:788103. pmid:35083440
- 48. Moher D, Liberati A, Tetzlaff J, Altman DG, PRISMA Group. Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement. Ann Intern Med. 2009;151(4):264–9, W64. pmid:19622511
- 49. MacLeod A, Summerfield Q. A procedure for measuring auditory and audio-visual speech-reception thresholds for sentences in noise: rationale, evaluation, and recommendations for use. Br J Audiol. 1990;24(1):29–43. pmid:2317599
- 50. Eikelboom RH, Bennett RJ, Manchaiah V, Parmar B, Beukes E, Rajasingam, Eikelboom RH, Bennett RJ, Manchaiah V, Parmar B, Beukes E, Rajasingam SL, et al. International survey of audiologists during the COVID-19 pandemic: Use of and attitudes to telehealth. Int J Audiol. 2022;61(4):283–92. pmid:34369845