The Modulation Transfer Function for Speech Intelligibility

doi:10.1371/journal.pcbi.1000302

Figure 1.

Component spectrotemporal modulations make up the modulation spectrum.

(A) Spectrogram of a control condition sentence, “The radio was playing too loudly,” reveals the acoustic complexity of speech (Audio S1). All supporting sound files have been compressed as .mp3 files for the purpose of publication; original .wav files were used as stimuli. (B) Example spectrotemporal modulation patterns circled in the sentence (A) can be described as a time-varying weighted sum of component modulations. (C) The MPS shows the spectral and temporal modulation power in 100 sentences. The outer, middle, and inner black contour lines delineate the modulations contained in 95%, 90%, and 85% of the modulation power, respectively. Down-sweeps in frequency appear in the right quadrant, whereas upward drifts in frequency are in the left quadrant. Slower temporal changes lie near zero on the axis, while faster changes result in higher temporal modulations towards the left and right of the graph.

More »

Expand

Figure 2.

Spectral modulations differ in male and female speech.

(A,B) The MPS of the 50 corpus sentences spoken by males (A), and of the 50 spoken by females (B), with black contour lines as in Figure 1. White parenthetical labels on the y-axes of (A) and (B) show related frequencies demarcating the male and female vocal registers; they correspond to spectral modulations based on harmonic spacing. (C,D) Modulation filters that resulted in misidentification of the speaker's gender. (C) the speech MPS for female speakers is overlapped with the boundaries of the low-pass spectrotemporal filter. In this condition, speaker gender was misidentified in a quarter of the sentences, with 91% of those errors being females misidentified as male. (D) the same female speech MPS overlapped with a notch filter that removed modulations from 3 to 7 cycles/kHz. Of the 21% gender errors in this condition, 95% were female speakers misidentified as male.

More »

Expand

Figure 3.

Comprehension of low-pass modulation filtered sentences.

(A,B) Grayed areas of thumbnails show spectrotemporal modulations removed by low-pass modulation filtering in the spectral (A) or temporal (B) domain. Units and axis ranges are the same as in Figure 2. Each thumbnail represents a stimulus set analyzed in (C,D). (C,D) Mean±s.e. performance in transcribing words from the low-pass modulation filtered sentences. Cutoff frequencies on the x-axes of the two graphs are presented in units appropriate to the spectral or temporal domain, but could equally well be viewed on one continuous scale in either unit. Symbols show SNR levels. Dashed line shows control performance at +2 dB SNR; dotted line shows control performance at −3 dB SNR. Points at cutoff frequencies which share no capital letters in common (above line plots) are significantly different (repeated measures ANOVA, Bonferroni post-hoc correction, p<0.0008) at the +2 dB SNR condition. (E and G) Spectrograms of an example sentence (same as in Figure 1) with the most extreme spectral modulation filtering (with a low-pass cutoff of 0.5 cycles/kHz; Audio S2) and the spectral modulation filtering at which comprehension became significantly worse (4 cycles/kHz; Audio S3), respectively. LP = Low-pass. (F and H) Spectrograms of the example sentence with the most extreme temporal modulation filtering tested (having a low-pass cutoff of ∼3 Hz; Audio S4), and the temporal modulation filtering at which comprehension became significantly worse (cutoff 12 Hz; Audio S5).

More »

Expand

Figure 4.

Comprehension of speech with notch-filtered modulations or “core” modulations.

(A–C) The speech modulation spectrum with filtered modulations denoted by grayed areas as in Figure 5A. (A) Spectral notch modulation filters. (B) Temporal notch modulation filters. (C) Core modulations most essential to comprehension in Figure 5 are depicted in full and zoomed-in thumbnail plots. Stimuli for the core condition were obtained by low-pass filtering in both the spectral and temporal modulation domains. (D,E) Mean±s.e. comprehension when either spectral (D) or temporal (E) modulation filters were applied to the sentences, along with control sentences (lighter gray bars) containing all or only core modulations (C). Stimulus conditions which share no lower case letters (above plots) in common are significantly different, as in Figure 5 (repeated measures ANOVA). (F) Spectrogram of the example sentence after spectral modulations between 3 and 7 cycles/kHz were filtered out (Audio S6). (G) Spectrogram of the example sentence containing only the core of essential modulations below 7.75 Hz and 3.75 cycles/kHz (Audio S7).

More »

Expand

Figure 5.

Comprehension of “core” modulations in speech with notch filtering.

(A,B) Notch filters in the spectral (A) or temporal (B) modulation domain removed modulations from sentences that contained only core modulations after having been low-pass filtered in both domains. As depicted in Figure 4C, x- and y-axes are 0 to ±7.75 Hz and 0 to 3.75 cycles/kHz, respectively. (C,D) Comprehension when spectral (C) or temporal (D) notch filters were applied to sentences containing only core modulations. See Figure 6C for a thumbnail of the core modulations. As in Figures 5 and 6, conditions which share no lower-case labels in common are significantly different (repeated measures ANOVA).

More »

Expand

Figure 6.

Combination of results from low-pass and notch modulation filtering.

(A,B) To combine the spectral and temporal results from low-pass (A) and notch (B) modulation filtering, we calculated the average percent error in word comprehension, and divided by the average control comprehension. Then we multiplied the normalized errors from the spectral and temporal notch filters. Black lines indicate the contours of modulation power, as in Figure 1. Red areas are more important for speech comprehension than blue. The summary plot of the low-pass spectral and temporal modulation filters used the additional error caused by each subsequent lowering of the cutoff. Notch and low-pass experiments had somewhat different results in the spectral domain. The notch filtering implicated modulations closer to the origin, but still intermediate in temporal modulation, as most crucial. This discrepancy suggests a non-linearity in the relative contribution of modulations: the removal of intermediate spectral modulations matters more when higher spectral modulations are missing as well. (Dropping the low-pass cutoff spectral frequency from 4 to 2 cycles/kHz significantly reduced performance, but the 1–3 and 3–7 cycles/kHz notch filters straddling that range produced no significant difference.) (C) Schematic of modulations underlying comprehension and gender identification. The summary cartoon shows a region of low spectral and intermediate temporal modulations is of the greatest importance for speech intelligibility (red), while a separate band of higher spectral modulations (blue) make a speaker sound female. Yellow outlines the modulations that did not significantly contribute to sentence comprehension in any experiment. (D) Sentence modulation transfer function. When compression design, speech recognition by machines, and cochlear implant applications impose constraints on the bandwidth of a speech signal, modulation filtering could reduce a speech signal to only the modulation components needed for comprehension (red area). Depending on the bandwidth permitted, increasingly more of the orange and then yellow areas of the modulation spectrum could be included to add to the perception of vocal source characteristics.

More »

Expand