A Statistical Method for the Analysis of Speech Intelligibility Tests

Speech intelligibility tests are conducted on hearing-impaired people for the purpose of evaluating the performance of a hearing device under varying listening conditions and device settings or algorithms. The speech reception threshold (SRT) is typically defined as the signal-to-noise ratio (SNR) at which a subject scores 50% correct on a speech intelligibility test. An SRT is conventionally measured with an adaptive procedure, in which the SNR of successive sentences is adjusted based on the subject's scores on previous sentences. The SRT can be estimated as the mean of a subset of the SNR levels, or by fitting a psychometric function. A set of SRT results is typically analyzed with a repeated measures analysis of variance. We propose an alternative approach for analysis, a zero-and-one inflated beta regression model, in which an observation is a single sentence score rather than an SRT. A parametrization of the model is defined that allows efficient maximum likelihood estimation of the parameters. Fitted values from this model, when plotted against SNR, are analogous to a mean psychometric function in the traditional approach. Confidence intervals for the fitted value curves are obtained by parametric bootstrap. The proposed approach was applied retrospectively to data from two studies that assessed the speech perception of cochlear implant recipients using different sound processing algorithms under different listening conditions. The proposed approach yielded mean SRTs for each condition that were consistent with the traditional approach, but were more informative. It provided the mean psychometric curve of each condition, revealing differences in slope, i.e. differential performance at different parts of the SNR spectrum. Another advantage of the new method of analysis is that results are stated in terms of differences in percent correct scores, which is more interpretable than results from the traditional analysis.


Introduction
Measuring speech intelligibility in noise is an important endeavor in the clinical management of hearing loss. It can be used to assess the benefit a person receives from a hearing aid or cochlear implant, and to track their performance over time. It is also used in the research and development of hearing devices, to compare the effectiveness of alternative sound processing algorithms.
A common approach is to play a pre-recorded sentence, mixed with noise, to the subject, who attempts to verbally repeat it. The clinician then records a score for the sentence based on the number of words that the subject repeated correctly. Alternatively, scores can be based on only the key words in the sentence, or based on morphemes, a linguistic unit (for example, in the sentence "He hits the ball", the word "hits" contains two morphemes, "hit" and "s"). A subject is typically tested with a list of 10 to 32 sentences taken from a corpus of sentences compiled for this purpose [1] [2] [3] [4].
Speech in noise tests are sometimes performed at a fixed signal-to-noise ratio (SNR), to give a percent-correct measure of intelligibility. If the test is repeated at different SNRs, and the scores are plotted as a function of SNR, the resulting curve is known as a psychometric function that is typically S-shaped, for example the logistic function [5]. When designing a study into the effects of a sound processing algorithm, the differences in performance between subjects can be so large that testing all subjects at the same SNR would be prone to floor or ceiling effects. An alternative is to measure the speech reception threshold (SRT) of each subject, which is typically defined as the SNR at which the subject scores 50% correct. The SRT is conventionally estimated by an adaptive procedure, in which the SNR of each sentence is adjusted based on the subject's previous responses. Adaptive threshold estimation methods were initially developed for experiments in which the subject provides a binomial response on each trial [6]; for example, identifying the correct interval in an N-alternative forced-choice test. These methods can readily be applied to sentence tests; if the subject correctly identifies more than half of the words in the sentence, then the SNR is reduced (making the next sentence more difficult), and conversely if the subject correctly identifies less than half of the words, then the SNR is increased (making the next sentence easier). With an appropriate adaptive rule, the SNR should converge to the SRT [3]. Two example adaptive tracks are shown in Fig 1. The SRT can be estimated as the mean of the SNR levels, excluding some initial sentences [1] [3]; or by the mean of the SNR levels at the turns, where a turn (or reversal) is defined as a trial in which the adaptive rule changed direction [6]; or by fitting a psychometric function [7] [4]. A set of SRT estimates is usually analyzed using a simple statistical method such as a t-test or repeated measures analysis of variance (ANOVA).
Although Dawson et al. [4] found that fitting a psychometric function provided the best SRT test-retest reliability, there are some limitations in this approach. Occasionally, a subject's average scores are not a monotonically increasing function of SNR; an example is shown in Fig  1. This could be due to random fluctuation, or a lapse in the subject's concentration, or a run of more difficult sentences (despite efforts to equalise sentence difficulty [4]). Such cases can produce a poor fit. Furthermore, the fitting method assumed a binomial distribution [5], but the assumption that a sentence containing K words consists of K independent Bernoulli trials is violated, because recognition of one word is not independent of the other words. Sentences representative of everyday conversation have contextual cues, meaning that if the subject recognises the first few words, then they are more likely to recognise the remaining words. At the SRT, although the average word score is 50%, it is relatively uncommon to score near 50% for any particular sentence; instead some sentences receive scores near 100%, and a roughly equal number of sentences receive scores near 0%. A histogram of the sentence scores for Study One (described in the next section), with large spikes at values 0% and 100%, is shown in Fig 2. Regardless of the method used to calculate the SRT, summarizing an entire adaptive track by a single number suffers from a loss of information. Applying repeated measures ANOVA to a set of SRT estimates implicitly assumes that the psychometric functions for the different conditions are of similar shapes, differing only in the SRT values, and differences in the slope of  Table 1. The top panel of each part shows the adaptive track, with sentence number running down the page; each sentence is represented by a circle, with its horizontal location indicating the SNR, and its gray-scale fill indicating the score, with 100% correct as white, and 0% as black. The green vertical line shows the SRT estimate obtained by averaging the SNRs of the final 16 sentences. The bottom panel of each part shows the mean percent correct score at each SNR, the psychometric function or its asymptotic value are ignored. We propose an alternative approach, a zero-and-one inflated beta regression model, in which an observation is a single sentence score rather than an SRT. This model makes fewer assumptions about the data and provides more valuable information.

Materials and Methods
The speech perception data sets The two studies described below were approved by the Human Research Ethics Committee of the Royal Victorian Eye & Ear Hospital, Melbourne, and each subject provided written informed consent.
The new statistical method was applied to data from two studies involving Nucleus cochlear implant recipients. The two studies shared a number of characteristics. Both studies used a repeated-measures design, in which each subject served as their own control, and the aim was to compare performance with different sound processing algorithms under one or more with the size of each square proportional to the number of sentences that were presented at that SNR, and a confidence interval calculated according to the binomial distribution. It also shows the fitted psychometric curve, and the blue vertical line indicates the corresponding SRT estimate [4]. listening conditions. The two studies administered an adaptive SRT test, using the Australian Sentence Test in Noise (AuSTIN) [4]. The target speech was presented at 65 dB SPL, and the level of the interfering noise was adjusted based on the subject's responses. Morpheme scoring was used. Each adaptive track used a list of 20 sentences, and the SRT was calculated as the mean of the SNRs of the final 16 sentences.

Study One
The first study compared the speech recognition of seven bilateral cochlear implant recipients in the presence of an interfering talker. The first factor of interest was the sound processing algorithm ("algorithm"). The details of the sound processing algorithms are not relevant to the statistical analysis of the results, so the three algorithms are simply labelled "A", "B" and "C", and the question was whether the three algorithms yielded differences in the subjects' performance. The second factor was the direction of the interfering talker ("noise direction"), which was either from the front ("F") or from both sides ("S"). As the target speech was presented from the front, it was hypothesized that performance would be better for side interferers, as subjects could potentially use the difference in spatial location to segregate the two voices. The third factor was the gender of the interfering talker ("noise gender"). As the target voice was female, it was hypothesized that performance would be better for a male interferer, as subjects could potentially use the difference in voice pitch to segregate the two voices. For the noise direction and noise gender factors, interaction with the algorithm factor would indicate that the sound processing algorithms differed in their effectiveness in conveying spatial or pitch cues. The SRT for each subject was measured four times (i.e. four adaptive tracks, totalling 80 sentences), for each of the 12 conditions (3 algorithms × 2 noise directions × 2 noise genders). The adaptive rule used a 4 dB step size for the initial four sentences and a 2 dB step size for the remaining sentences.

Study Two
The second study [8] compared speech intelligibility as a function of a single factor, the sound processing algorithm ("algorithm"), consisting of a standard algorithm ("Beam") and five variants of a spatial noise reduction algorithm, labeled "SpS0", "SpZ-3", "SpZ0", "SpZ+3", "SpZ+6" (again, the details of the sound processing algorithms are not relevant here). Twelve subjects participated. The target speech was presented from the front, while the noise consisted of four interfering talkers, each presented from a separate loudspeaker in the rear half-circle, with locations that changed from sentence to sentence. The SRT for each subject was measured twice (i.e. two adaptive tracks, totalling 40 sentences), for each of the six algorithms. The adaptive rule was the same as in the first study, with the exception that the SNR for the fifth sentence was equal to the average of the SNRs of the initial four sentences and the SNR at which the fifth sentence would have been presented in response to the score of the fourth sentence [3]. The primary hypothesis was that the spatial noise reduction algorithms would give better performance than Beam, with a secondary goal to determine which variant of spatial noise reduction gave the best performance.

Traditional Approach
The traditional approach used SRT as the response variable in ANOVA models. A single observation was therefore the SRT calculated over a track of 20 sentences. In both studies, every subject was evaluated across all factors, and repeated measures ANOVA was applied. The underlying assumption of sphericity was assessed by Mauchly's test of sphericity. The Greenhouse-Geisser adjustment was applied to adjust the degrees of freedom in case of violation of the sphericity assumption. The significance level was set as 0.05. If the main effect was found to be significant, multiple comparisons were performed subsequently to generate inferences.

The Statistical Model of The New Approach
In this approach, a single observation is a sentence. The response variable y in the statistical model is the proportion of morphemes correctly identified, i.e. y = r/N, where N is the number of morphemes in a sentence and r is the number of morphemes correctly identified. A binomial model may seem an obvious choice for y, but clearly the independence assumption of the binomial model is not met, due to context effects. As y is a proportion, a promising probability model is the beta distribution: which has mean E(y) = α/(α+β). It is advantageous, in regression modeling, for the response distribution to be expressed in a parametrization in which the mean is a parameter. We therefore base our modeling on the following alternative parametrization of the beta distribution [9]: which has the advantage that E(y) = μ. We have Var (y) = σ 2 μ(1−μ), and the parameters μ and σ are connected with the original parameters α and β in Eq (1) with relations α = μ(1−σ 2 )/σ 2 and β = (1−μ)(1−σ 2 )/σ 2 . A feature of the beta distribution is that the endpoints y = 0 and y = 1 are inadmissible. If the data had small frequencies at either endpoint, with most observations lying in the interior of (0, 1), this could be accommodated by scaling y to lie in the interior of (0, 1). However in our data we observe high frequencies at zero (no morphemes recognized) and one (all morphemes correctly identified). This feature is accommodated by the zero-andone-inflated beta distribution [10], which has parameters p 0 and p 1 for probability spikes at zero and one, respectively. We write this as a mixed discrete-continuous probability function: ; m; sÞ y 2 ð0; 1Þ which has overall mean E(y) = (1−p 0 −p 1 )μ+p 1 . However, we need to reparametrize once more, because in estimating parameters of the probability Eq (3), we have to respect the constraint 0 < p 0 + p 1 < 1, which is awkward to achieve numerically. In addition, parameter estimatesp 0 andp 1 are negatively correlated, which is not a good property. We use instead f 4 y; m; s; n; t ð Þ¼ where ν > 0 and τ > 0. The probability masses of measures 0 and 1 are associated with the two shape parameters ν and τ through the relations ν = p 0 /(1−p 0 −p 1 ) and τ = p 1 /(1−p 0 −p 1 ). The four parameters μ, σ, ν and τ are modeled with covariates, as well as random effects to account for within-subject correlation: where x, z, h and k are vectors of known covariates, which may be overlapping or distinct; β, γ, λ and ρ are corresponding coefficient vectors; and u j $ N ð0; d 2 j Þ; j ¼ 1; . . . ; 4 is a random effect for subject. Logit links are used for the parameters constrained to (0, 1), i.e. μ and σ, and log links for those constrained to R + (ν and τ), as is common practice in generalized linear modeling. Parameter estimation is achieved in the R package gamlss, in which up to four distribution parameters may be modeled simultaneously [11,12], using maximum (penalized) likelihood estimation. Model selection was based on the Generalized Akaike Information Criterion (GAIC).
Although parameters ν and τ determine the probability masses for proportion correct at zero and one, the probabilities p 0 and p 1 are not modeled with regression structures directly, and the effect of the covariates on these probabilities is difficult to interpret. For given covariate values h and k, fitted values for proportion correct equal to zero (p 0 ) and one (p 1 ) are derived algebraically, with random effectsû j assumed to be zero: To facilitate interpretation, fitted probabilities enhanced with confidence intervals are plotted against the covariates. The confidence intervals are based on the parametric bootstrap [13].

Study One
An excerpt of the data from the first study is shown in Table 1, and the full dataset is given in S1 Data. It shows the scores for two sentence lists of 20 sentences each, for the same subject and listening condition. Corresponding plots are shown in Fig 1. For the purpose of the traditional analysis, the 40 sentence scores in Table 1 are aggregated to two SRT estimates, given in Table 2.
Traditional Approach. A three-way repeated measures ANOVA was used and no violation of the sphericity assumption was found. With respect to main effects, algorithm and noise direction were not significant. The noise gender factor, however, was significant, with estimated marginal group means of SRT of 2.59 dB for male interferers, and 3.99 dB for female interferers. The interaction term of noise gender and algorithm was also significant, and the summary is given in Table 3.
Effects of algorithm were investigated by dividing the data into two subsets, with male and female interferers separately. With female interferers, the effect of algorithm was significant (F 2,54 = 4.676, p = 0.013), and pairwise comparisons with Bonferroni adjustments suggested that algorithm A had significantly better (lower) SRT than algorithm B (p = 0.021), with estimated marginal mean difference of -0.918 dB. No other comparisons were significant. With male interferers, the effect of algorithm, however, was not significant (F 2,54 = 1.444, p = 0.245).
Effects of noise gender were also investigated by splitting the data by algorithm. The noise gender factor was significant for each algorithm (A: F 1,27 = 5.019, p = 0.033; B: F 1,27 = 29.096, p < 0.001; C: F 1,27 = 28.331, p < 0.001), suggesting that SRT estimates were significantly better (lower) for male interferers than for female. No other terms were significant.
Proposed Approach. Parameter estimates for the modeling of μ, σ, ν and τ are given in Table 4. No covariates were significant for σ. For μ, ν and τ, SNR, noise gender and algorithm were all significant. In addition, noise gender-algorithm interaction was significant for μ and τ; SNR-algorithm interaction was significant for ν; and SNR-noise gender interaction was significant for τ.
The fitted overall means of percent correct (i.e. ð1 Àp 0 Àp 1 Þm þp 1 ) for each algorithm are shown in Fig 3, separately for both noise genders, together with confidence intervals constructed by parametric bootstrap. The curves for algorithms B and C largely overlap at all SNRs, indicating little difference between those algorithms. However, the curve for algorithm A has a steeper slope than those for algorithms B and C; for female interferers, the three curves overlap at low SNRs, but start to separate at higher SNRs. This difference in slope between algorithms was not detected in the traditional approach. Similarly, Fig 4 presents the fitted value curves and confidence intervals, separately for each algorithm, showing the effect of noise gender. For all three algorithms, the male interferer provided significantly better speech intelligibility than the female, with the difference being larger for algorithms B and C.
The fitted zero-and-one inflated beta distribution is shown in Fig 5 for the subset of subjects having algorithm = A, gender = Female, and SNR = 5 (n = 312).

Study Two
Traditional Approach. The dataset is given in S2 Data. Following Hersbach et al. [8], a one-way repeated measures ANOVA was used, in which marginal violation of the sphericity assumption was found (p = 0.048) and the Greenhouse-Geisser adjustment was applied to the degrees of freedom of F-statistics involved. The algorithm factor was significant (F 3.457,79.506 = 47.401, p < 0.001). More specifically, the estimated means of SRT were 0.171 dB, -2.583 dB, respectively. Pairwise comparisons with Bonferroni adjustments showed that Beam had significantly worse SRTs than all five variants of the spatial noise reduction algorithm (p < 0.001). In addition, SpS0 had significantly worse SRTs than SpZ0, SpZ+3 and SpZ+6 (p = 0.008, p < 0.001, p = 0.005 respectively). Pairwise comparisons among the four variants of SpZ-3, SpZ0, SpZ+3 and SpZ+6 suggested no significant differences. Proposed Approach. Parameter estimates for the modeling of μ, σ, ν and τ are given in Table 5. No covariates were significant for σ. Fig 6 shows the fitted overall means of percent correct, i.e. the mean psychometric functions for the six algorithms. All curves appear to have the same slope. A horizontal line at 50% correct intercepts each psychometric function at an SNR equal to its SRT, illustrating the 4.6 dB SRT improvement of SpZ+3 over Beam, as found in the traditional approach.
Confidence intervals for the mean psychometric functions, obtained by parametric bootstrap, are presented in Fig 7, demonstrating that Beam had significantly lower speech intelligibility scores than all five spatial noise reduction algorithm variants. The curves for SpZ0, SpZ +3, and SpZ+6 overlap, suggesting that their performances can be viewed as indistinguishable. The curve for SpS0 is separated from the curves of SpZ0, SpZ+3 and SpZ+6, indicating a significant difference.

Discussion and Conclusions
The traditional approach has two stages: firstly, an SRT estimate is computed for each adaptive track; and secondly, a linear model is applied to the set of SRT estimates. The first stage, which distils a set of sentence SNR levels and scores into a single number, the SRT estimate, discards much of the available information. For example, the within-subject, within-condition variability is measured by the spread of four SRT values in study one, and two SRT values in study two. The corresponding set of sentence scores is a potential source of information regarding this variability, but is ignored. A psychometric fit can provide an estimate of the slope of the psychometric function, with shallower slopes implying more variability in SRT estimates, but there is no obvious means of incorporating slopes into the traditional approach. In contrast, the proposed approach applies a generalized linear model to the entire set of sentence scores, utilizing all available information. It can provide the mean SRTs of each condition, as in the traditional approach, but is more informative as it also provides estimates of the entire mean psychometric function for each condition.
One limitation of a retrospective analysis of the data is that the adaptive rule used in these studies started with a high SNR, then adjusted the SNR towards the 50% correct point. This concentrates the observations near the 50% correct point, which is the most efficient placement for estimating the SRT [6] but yields relatively few observations at lower SNRs. This makes differences at the extremes of the SNR spectrum difficult to detect. If the goal of a study is to estimate the entire mean psychometric function, then a different adaptive rule should be used. One solution is to randomly interleave multiple adaptive tracks, each targeting different percent correct scores, e.g. 30% and 70% correct [6] [7]. This is readily handled by the proposed approach. The most practical benefit of the proposed approach is that it allows the difference between two conditions to be expressed in terms of percent correct scores. For example, in study two, the traditional approach states that the best spatial noise reduction algorithm gave a 4.6 dB SRT benefit over Beam. However, terms such as decibels and SRTs are unfamiliar to most cochlear implant recipients. Instead, the proposed approach allows the result to be better understood: in a noisy situation, average scores improved from 25% correct with Beam to 62% with the best spatial noise reduction algorithm. Supporting Information S1 Data. Study One data. StudyOne.xlsx is the full data set (as in Table 1), StudyOne_srt.xlsx is the data set summarized as SRT (as in Table 2).