No interaction between fundamental-frequency differences and spectral region when perceiving speech in a speech background

Differences in fundamental frequency (F0) or pitch between competing voices facilitate our ability to segregate a target voice from interferers, thereby enhancing speech intelligibility. Although lower-numbered harmonics elicit a stronger and more accurate pitch sensation than higher-numbered harmonics, it is unclear whether the stronger pitch leads to an increased benefit of pitch differences when segregating competing talkers. To answer this question, sentence recognition was tested in young normal-hearing listeners in the presence of a single competing talker. The stimuli were presented in a broadband condition or were highpass or lowpass filtered to manipulate the pitch accuracy of the voicing, while maintaining roughly equal speech intelligibility in the highpass and lowpass regions. Performance was measured with average F0 differences (ΔF0) between the target and single-talker masker of 0, 2, and 4 semitones. Pitch discrimination abilities were also measured to confirm that the lowpass-filtered stimuli elicited greater pitch accuracy than the highpass-filtered stimuli. No interaction was found between filter type and ΔF0 in the sentence recognition task, suggesting little or no effect of harmonic rank or pitch accuracy on the ability to use F0 to segregate natural voices, even when the average ΔF0 is relatively small. The results suggest that listeners are able to obtain some benefit of pitch differences between competing voices, even when pitch salience and accuracy is low.


Introduction
Pitch differences between competing voices facilitate our ability to segregate speech from a background of other speakers [1][2][3][4]. It is, for example, easier to understand a female speaker in the presence of a male speaker than in the presence of another female speaker [5]. Therefore, it seems plausible that the ability to make use of pitch differences would improve with increasing fundamental frequency (F0) coding (or pitch) accuracy. Sounds whose pitch is perceived less accurately might be perceived as more similar to each other and might therefore be more likely to fuse and harder to segregate. However, the importance of pitch accuracy for understanding speech in a speech background has not been clearly demonstrated in previous studies. The aim of the current study was to test this predicted relationship between pitch accuracy and speech segregation ability.
The accuracy with which we are able to discriminate the pitch of a harmonic complex tone depends on the F0 and the harmonic numbers present. For F0s in the average range of speech (100-200 Hz), pitch discrimination is best (implying accurate F0 coding) when harmonics below about the 10 th are present [6][7][8][9][10]. When these lower-numbered harmonics are present, pitch discrimination is also independent of the phase relationships between the harmonics, suggesting that these harmonics are spectrally resolved to some extent. In contrast, when only harmonics above the 10 th are present in this range of F0s, pitch discrimination is poorer and is affected by the phase relationships between harmonics, suggesting that interactions occur between these spectrally unresolved harmonics [6][7][8][9][10].
Psychoacoustic studies of sound segregation have often been carried out with interleaved sequences of tones. Some of these studies have investigated segregation based on differences in pitch accuracy and have varied the accuracy by systematically varying whether resolved or only unresolved harmonics are present. Previous studies have found that stream segregation can occur with alternating sequences of tones, even if the tones consist only of unresolved harmonics [11][12][13][14]. However, the question of whether streaming is greater with resolved than unresolved harmonics has received mixed answers. In cases where the listeners' task was to segregate the streams, some studies have shown little difference in streaming between conditions containing resolved or only unresolved harmonics [11,15], whereas another study using a similar approach found significantly greater stream segregation when resolved harmonics were present than when only unresolved harmonics were present [12]. However, in situations where the task was either neutral or encouraged listeners to integrate the sequences into a single stream, the results have been consistent across studies in showing greater segregation for complex tones containing resolved harmonics than for tones containing only unresolved harmonics [13,14]. These findings support the idea that pitch accuracy can affect our ability to segregate sounds.
Less is known about the role of low-numbered harmonics in the context of segregating competing speech. Bird and Darwin [2] showed that lower harmonics dominate performance in a speech-segregation task based on F0 differences, but they did not test any conditions containing only high-numbered harmonics. Oxenham and Simonson [16] explored the effect of harmonic rank on speech intelligibility by comparing conditions where the target and singletalker masker had been lowpass (LP) or highpass (HP) filtered to either retain (LP-filtered) or remove (HP-filtered) the spectrally resolved components from the target and masker [16]. The LP and HP cutoff frequencies were selected to produce roughly equal performance in noise for both conditions. Surprisingly, performance in the LP and HP conditions improved by similar amounts when the noise masker was replaced by a single-talker masker with a different average F0, suggesting no clear benefit of having resolved harmonic components in the speech. However, that study only used relatively large values of average ΔF0 that according to recent F0 estimates were approximately 4 and 8 semitones (ST). Moreover, this study did not parametrically vary the ΔF0 between the target and masker. It may be that pitch accuracy is only relevant for more challenging conditions, i.e. for conditions with smaller average values of ΔF0. Thus, it remains unclear whether the effect of ΔF0 on performance is affected by the presence or absence of low-numbered, spectrally resolved harmonics. The aim of the present study was to determine whether there is an effect of spectral region, and hence pitch coding accuracy, on the ability of listeners to use average F0 differences between a target and an interfering talker to understand natural speech.

General methods
The experiments were conducted in a double-walled acoustically shielded booth. The stimuli were generated in MATLAB (The Mathworks, Natick, MA, USA) and were presented at a sampling rate of 48 kHz via a Fireface UCX sound card (RME, Haimhausen, Germany) and HD650 headphones (Sennheiser, Wedemark, Germany). The experimental protocols were approved by the Scientific Ethical Committees of the Capital Region of Denmark (H-16036391). All participants provided written informed consent before the experiment.

Speech experiment
Eighteen native speakers of Danish (9 female, 9 male) between 20 and 28 years of age (mean = 23.7, SD = 2.4) were tested. All participants had audiometric thresholds at octave frequencies between 250 and 8000 Hz no greater than 20 dB hearing level (HL). Speech intelligibility was tested for target sentences in the presence of a single-talker masker under HP, LP, and unfiltered (broadband) conditions. The target in each trial was a sentence from the conversational language understanding evaluation (CLUE) speech corpus [17]. The masker was created from the concatenated speech of one talker from recordings of conversations between two talkers [18], after all gaps exceeding 100 ms, non-Danish words, loud exclamations, and other sounds such as laughter were removed. The remaining 222.3 s of masker was then divided into 180 overlapping segments of 2.47 s each. The maskers for the training (30 segments) were taken from a different talker from another recorded conversation. The maskers started 500 ms before the target, ended at least 100 ms after the target, and were gated with 50-ms raised-cosine onset and offset ramps. One CLUE sentence (always the same) was presented in quiet immediately before each trial as a guide to help listeners focus on the target voice. All talkers were male. The original long-term average F0 was approximately 110 Hz for the target, 139 Hz for the masker used for testing, and 148 Hz for the masker used for training. The F0s of the maskers were manipulated in Praat [19] to obtain differences between the long-term average F0 of the target and masker (ΔF0) of 1, 3, and 5 ST for the training and 0, 2, and 4 ST for the main test. Different ΔF0s were used for training and testing to familiarize the participants with the experimental design without training the specific conditions tested. The range of tested average ΔF0s was relatively small because it was hypothesized that accurate pitch coding would be most important at small values of ΔF0. Furthermore, larger ΔF0s have been previously shown to produce similar results for both LP-and HP-filtered speech [16]. The average long-term F0 of the masker was always the same as or higher than that of the target. The speech maskers were filtered to have the same long-term spectrum as the CLUE sentences. The inherent F0 variations of both the masker and target were maintained to provide a more natural percept [20].
For the LP and HP conditions, the target, masker, and guide sentence were lowpass or highpass filtered with a 4 th -order Butterworth filter after being combined. The conditions with ΔF0 of 4 ST were used as a reference, since it is the condition that is closest to the ones tested by Oxenham and Simonson [16]. Cutoff frequencies of 800 and 1500 Hz were chosen for the LP and HP conditions, respectively, based on pilot experiments that indicated they would yield similar performance. A target-to-masker ratio (TMR) of 0 dB was used for the LP and HP conditions and a TMR of -15 dB was used for the broadband conditions to obtain similar performance for ΔF0 = 4 ST. Examples of the stimuli from the HP and LP conditions are shown in Fig 1. A Gaussian noise, spectrally shaped to have the same long-term spectrum as the CLUE sentences (before filtering), was filtered with a 4 th -order Butterworth filter and added to the filtered speech stimuli. For the LP condition, the noise was highpass filtered with a cutoff frequency of 800 Hz and for the HP condition, the noise was lowpass filtered with a cutoff frequency of 1500 Hz. The root mean squared (RMS) level of the noise before filtering was 12 dB lower than the unfiltered target speech. The target and maskers combined were presented at 70 dB sound pressure level (SPL). The purpose of this off-band noise was to limit the use of speech information in the slopes of the filters and to mask distortion products that might occur in the HP-filtered condition.
In the main test, each of the nine conditions (three filter conditions by three ΔF0s) was tested with two lists, each containing 10 sentences. The presentation order of the conditions was randomized within each of two consecutive blocks, each containing all of the nine conditions. The training consisted of three runs presented in the following order: 1) Broadband with a ΔF0 of 5 ST, presented at a TMR of -12 dB; 2) HP with a ΔF0 of 3 ST, presented at a TMR of 3 dB; 3) LP with a ΔF0 of 1 ST, presented at a TMR of 0 dB. The participants were instructed to listen for the voice of the guide sentence and were asked to type exactly what they heard that voice say, after each trial. No feedback was provided.

F0 discrimination
The purpose of the F0 discrimination experiment was to confirm that F0 discrimination is poorer, and more dependent on component phase (indicating spectrally unresolved harmonics), for the HP stimuli than the LP stimuli. Six participants (5 female and 1 male, aged between 21 and 26 years) were tested, all of whom had also participated in the speech experiment. The F0 difference limens (F0DLs) were measured with a two-interval, two-alternative forced-choice method, where each interval contained four tones of 200-ms duration, including 20-ms ramps, presented contiguously, as in earlier studies [3,20]. In the reference interval, all tones had the same F0, selected at random in each interval from a uniform 2-ST range centered on 131 Hz (corresponding to one standard deviation above the long-term average F0 of the target speech). In the target interval, the reference F0 was again selected at random from the same 2-ST range, with the actual F0 of the first and third tone being higher and the F0 of the second and fourth tone being lower than the reference F0. The difference in F0 between the high and low tones, ΔF0, was varied adaptively, while the F0s of the tones remained geometrically centered on the reference F0. The two intervals were separated by a 400-ms gap. Participants were asked to indicate which interval contained the variations in pitch. Feedback was provided after each trial. Each tone contained all harmonics up to 10 kHz and was filtered to have the same spectral envelope as the long-term spectrum of the target sentences. The harmonic components were added in either sine or random phase. For the random-phase conditions, the phase was chosen randomly and independently for each component in each tone from a uniform distribution from 0 to 2π.
As in the speech experiment, the tones were either lowpass or highpass filtered with a fourth-order Butterworth filter using cutoff frequencies of 800 and 1500 Hz, respectively. Moreover, as in the speech experiment, the same HP speech-shaped noise with a cutoff frequency of 800 Hz was added in the LP condition and the LP speech-shaped noise with a cutoff frequency of 1500 Hz was added in the HP condition, again to limit the audibility of distortion products. The average overall level of the tones was 70 dB SPL but the level was roved independently for each complex tone over a uniform range of 6 dB. The noise was presented at a level 12 dB below the nominal level of the tones before filtering.
Thresholds were estimated using a 3-down 1-up adaptive tracking procedure. The initial value of ΔF0 was 19.95% and it was initially decreased or increased by a factor of 2. After the second upper reversal the ΔF0 was changed by a factor of 1.26; after two more reversals, the ΔF0 was changed by a factor or 1.12 for the final six reversals. For each run, the threshold was calculated as the geometric mean of ΔF0 across the last six reversals. The experiment contained three blocks, with one run for each condition, and the order of the conditions was randomized within a block. The first run was used for training and the final thresholds were defined as the geometric mean of the two last runs.

Speech perception
Speech intelligibility was defined as the proportion of words reported correctly in each condition. All deviations except obvious misspellings or homophones were considered incorrect. Additional words and differences in word order were not penalized. Results from the broadband condition are shown in the left panel of Fig 2, and results from the LP and HP conditions are shown in the right panel. Considering first only the results from the 4-ST separations in each condition (the reference conditions), it can be seen that the mean scores were very similar. Indeed, a repeated-measures ANOVA using just the data from the 4-ST separations in the broadband, LP, and HP conditions showed no significant effect of filter condition [F(2,34) = 0.47, p = 0.63, Z 2 G = 0.0081], where Z 2 G is the generalized eta squared [22]. This finding confirmed that the cutoff frequencies chosen for the HP and LP conditions and the TMRs chosen for the filtered and broadband conditions yielded similar performance in the three reference conditions. The speech scores from the broadband conditions were analyzed separately, since the filtered conditions were measured at a higher TMR than the broadband conditions. For the broadband condition, there was a tendency for the scores to improve with increasing ΔF0.

F0 discrimination
Mean and individual F0DLs are shown in Fig 3. As expected, F0DLs were higher in the HP conditions than in the LP conditions, and phase affected F0DLs in the HP, but not in the LP, conditions. Results from a repeated-measures ANOVA on the log-transformed numbered, spectrally unresolved, harmonics being present when the stimuli were highpass filtered with a cutoff frequency of 1500 Hz, as in the speech experiment. The results confirm that the F0 of the speech should have been less accurately represented under HP conditions than under LP conditions, with F0DLs higher by an order of magnitude.

Discussion
In the speech experiment, the improvement in speech scores with increasing ΔF0 is consistent with results from earlier studies [1][2][3]23]. The lack of a main effect of filter type when considering the reference conditions confirms that we were successful at selecting filter cutoff frequencies that produced roughly equal performance in the LP and HP conditions when ΔF0 = 4 ST. Due to this planned equalization of overall performance, it is more relevant to compare the interaction between the filtering condition and ΔF0, or the relative slopes of the functions, rather than the absolute scores. There is a slightly steeper slope for the LP than the HP condition with increasing ΔF0, consistent with a stronger effect of ΔF0 in conditions with the more accurate pitch. However, this difference in slopes did not reach significance, as indicated by the lack of an interaction between filter type and ΔF0. The number of participants was reasonably large (N = 18) and estimated effect size of the interaction was so small (Z 2 G = 0.0074) that the lack of an effect does not seem likely to be due to a lack of statistical power. To determine whether there were substantial individual differences in the influence of LP or HP filtering on the effect of ΔF0, the difference between the scores obtained in the LP and HP conditions are plotted for each participant in separate panels in Fig 4. Here, a positive slope indicates a larger effect of the F0 difference in the LP than in the HP conditions. About half of the participants (Fig 4; P1-P9  HP conditions (positive slope), but the difference was generally small and not always monotonic. Furthermore, there was considerably larger effect of ΔF0 for the HP-filtered than for the LP-filtered conditions (negative slope) for three participants (Fig 4; P10-P12) and little difference between the effect of ΔF0 for the two filter conditions for six participants (Fig 4; P13-P18). This suggests that even if there is an interaction between filter condition and ΔF0 for some participants, it is not a large or robust effect. The lack of interaction is consistent with the results from Oxenham and Simonson [16], who found similar performance for HP-and LP-filtered conditions for both the natural and pitch-shifted single-talker maskers used in their study. However, the lack of effect remains somewhat puzzling, given the clear effect of filtering on F0DLs. Possible explanations include the different forms of speech information conveyed in the low and high spectral regions and different temporal properties [24][25][26][27]. For instance, increased temporal sparsity at high frequencies may result in more opportunities to "glimpse" the target in the high-frequency region, resulting in more influence of differences in F0, thereby counteracting the effect of reduced pitch accuracy at high frequencies. In addition, some studies suggest that the importance of different spectral regions for speech understanding may vary across participants; such variations may help explain the individual differences found in the present study [28,29]. Arguing against this interpretation is the fact that normal-hearing listeners and hearingimpaired listeners seem to make similar use of F0 differences between talkers when alternating speech sounds are presented to them [30], despite the fact that hearing-impaired listeners typically show poorer pitch discrimination abilities than normal-hearing listeners [31]. Another possible explanation is that, despite testing the smallest possible long-term average ΔF0 of 0 STs, the momentary differences in ΔF0, due to the F0 fluctuations, might have been too large for differences in pitch accuracy to affect speech intelligibility. However, such fluctuations are intrinsic of natural speech and it seems unlikely that smaller ΔF0s would occur in real-life situations. Thus, the results suggest that pitch accuracy within the range found in normal speech may not be a limiting factor for understanding speech in a background of other speech in real-life situations. Nevertheless, it still remains possible that even poorer pitch perception, such as that experienced by many cochlear-implant users, may explain their reduced masking release in the presence of competing talkers [32][33][34][35].
In summary, this study tested speech intelligibility in a background of a single-talker masker and found a small effect of ΔF0 but a similar relation between performance and ΔF0 in both LP and HP conditions. This outcome suggests that the differences in pitch accuracy between low-numbered and high-numbered harmonics is not a major factor in our ability to use F0 differences between competing talkers to better understand natural speech.