Figures
Abstract
Two rapidly alternating tones with different pitches may be perceived as one integrated stream when the pitch differences are small or two separated streams when the pitch differences are large. Likewise, timbre differences between two tones may also cause such sequential stream segregation. Moreover, the effects of pitch and timbre on stream segregation may cancel each other out, which is called a trade-off. However, how timbre differences caused by specific patterns of spectral shapes interact with pitch differences and affect stream segregation has been largely unexplored. Therefore, we used stripe tones, in which stripe-like spectral patterns of harmonic complex tones were realized by grouping harmonic components into several bands based on harmonic numbers and removing harmonic components in every other band. Here, we show that 2- and 4-band stimuli elicited distinctive stream segregation against pitch proximity. By contrast, pitch separations dominated stream segregation for 16-band stimuli. The results for 8-band stimuli most clearly showed the trade-off between pitch and timbre on stream segregation. These results suggest that the stimuli with a small number (4) of bands elicit strong stream segregation due to sharp timbral contrasts between stripe-like spectral patterns, and that the auditory system looks to be limited in integrating blocks of frequency components dispersed over frequency and time.
Citation: Jhang G-Y, Ueda K, Takeichi H, Remijn GB, Hasuo E (2025) Rivalry between pitch and timbre in auditory stream segregation. PLoS One 20(6): e0323964. https://doi.org/10.1371/journal.pone.0323964
Editor: Si Chen, Hong Kong Polytechnic University, HONG KONG
Received: February 2, 2024; Accepted: April 17, 2025; Published: June 5, 2025
Copyright: © 2025 Jhang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper and its Supporting information files.
Funding: This research was supported by the Japan Society for the Promotion of Science (JSPS; https://www.jsps.go.jp) KAKENHI Grant No. JP19H00630 for Kazuo Ueda, and by the Japan-Taiwan Exchange Association (https://www.koryu.or.jp) with a scholarship for Geng-Yan Jhang under the supervision of Kazuo Ueda. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. There was no additional external funding received for this study.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Auditory scene analysis [1] refers to the auditory function in which we perceptually organize the outer world filled with a mixture of sounds. Sometimes, we can exploit binaural cues to separate several sounds, but the cues are not always available or practical. Even in such situations, normal-hearing listeners usually successfully pick up a target from the background sounds, like a talker’s voice talking on a noisy railway platform over a phone. Linguistic cues should contribute a lot to segregating speech and background noise in such a situation; nevertheless, the perceptual characteristics of the auditory system should be involved in the segregation process. Furthermore, if both a target and background are nonspeech, like a warning signal in background noise, we must make use of other cues, such as pitch and timbre. The auditory system can organize and segregate a sound mixture into more than two streams with the pitch and timbre differences of the sounds. The current investigation focuses on how pitch and timbre interact in auditory stream segregation [2].
Miller and Heise [3] reported that two rapidly alternating pure tones (i.e., tones having only one frequency component and a sinusoidal waveform) with some slight frequency differences are heard as a trill, a continuously changing pitch pattern. However, the two tones with a wide frequency separation are perceived as two separate and unrelated melodies. Since then, sequential stream segregation has been a widely studied research topic [4–21,22]. In particular, low-high-low (LHL_) and high-low-high (HLH_) tone patterns, i.e., so-called triplets (Fig 1), are the standard research tools to investigate sequential stream segregation caused by pitch differences. Here, “L” signifies a “low-pitch tone,” “H” a “high-pitch tone,” and “_” a “silent gap”. In this connection, we define a “(fundamental) frequency separation” as a “(fundamental) frequency difference between a low-pitch tone and a high-pitch tone, expressed in semitone units” in this article.
The sequence in each panel consists of low-pitch and high-pitch tones (Ls and Hs). The dashed lines represent perceptual connections between tones. (a) One-stream perception and (b) two-stream perception for an LHL_ tone pattern (“_” signifies a “silent gap”). (c) and (d) show similar examples for an HLH_ tone pattern.
Suppose a series of “LHL_” triplets is presented. If the tone sequence is perceptually grouped, then a listener will hear one stream with a galloping rhythm (LHL_LHL_LHL_...). In contrast, if the tone sequence is perceptually segregated into two streams, a listener will hear a low-pitch-tone sequence with a short gap in between (L_L_L_...) and a high-pitch-tone sequence with a long gap in between (H___H___H___...).
One possibility to increasing probability of segregation is to widen the frequency separation between Ls and Hs [4], and the other possibility is to prolong a sequence to build up segregation [5,6]. With an induction sequence prior to a test sequence consisting of three sets of LHL_ tone triplets, Haywood and Roberts [13–15] reported a build-up in sequential stream segregation in participants’ responses to the final triplets. At the same time, with only three sets of triplets, they found that a separation of 14 semitones between L and H tones caused segregation around 75%–90% of the time, while a separation of four semitones caused almost no segregation. With a 10-semitone separation, the percentages of segregation responses came in between (around 25%–55%).
Harmonic complex tones have been used in some studies on sequential stream segregation. A harmonic complex tone consists of a series of pure tones called harmonics with frequencies that are integer multiples of a fundamental frequency, i.e., the lowest frequency. The pitch of a harmonic complex tone usually corresponds to the pitch of the fundamental; however, even if the fundamental frequency component is missing, the pitch of the harmonic complex tone does not change. This is called the missing fundamental phenomenon. When one modifies the level balance of frequency components in a harmonic complex tone, timbre varies, although this is not the only factor that affects the timbre of a harmonic complex tone.
Fundamental frequency separation is a prominent cue for segregation when harmonic complex tones have clear pitches with resolvable harmonics, that is, usually the first five to eight harmonics [23,24]. In addition, even without resolvable harmonics, fundamental frequency separation between bandpass-filtered harmonic complex tones induces stream segregation [23,25–27]. A sequence of amplitude-modulated noises may be perceptually segregated based on pitch separation [28–30].
Other factors related to timbre can influence stream segregation: differences in spectral regions [4,7,27,31], the number of harmonics [9], the spectral composition between odd- and even-numbered harmonics [8], spectral peaks [32], spectral slopes [33–35], and the amplitude envelope patterns across harmonics, i.e., spectrum variation over time [12].
Some studies have focused on the combined effects of pitch and timbre differences on sequential stream segregation. Pitch and timbre have been revealed to be competitive [7,32] or interactive [34] in stream segregation. Thus, the effects of pitch and timbre on stream segregation may cancel each other out. We call this relationship hereafter a trade-off between pitch and timbre in stream segregation. In these studies, timbre was manipulated by shifting a set of four consecutive harmonics [7], moving a spectral peak [32], or tilting the overall spectral slope [34].
However, these studies left room open to investigate how timbre differences caused by specific patterns of spectral shapes affect stream segregation. The most recent investigations on degraded speech perception provide clues for extending previous investigations on sequential stream segregation. It has been established that speech intelligibility drastically varies when speech sentences are interrupted in time and frequency [36,37]. Ueda et al. call such speech checkerboard speech because its spectrogram looks like a checkerboard. The intelligibility of checkerboard speech varies drastically depending on the combination of the number of frequency bands and segment duration. The intelligibility of 20- or 16-band checkerboard speech is almost at the ceiling, irrespective of segment duration. However, the intelligibility of two- or four-band checkerboard speech is highest (more than 90%) at 20-ms segment duration, lowest (around 35%–40%) at around 80–160 ms, and moderate (more than 50%) at 320 ms. We may conjecture that spectrotemporal interruption with two or four frequency bands and 80–160-ms segment duration promotes auditory segregation, hampers integration, and reduces intelligibility.
To test this hypothesis, it is necessary to examine how the auditory system integrates or segregates nonspeech stimuli, rapidly switching in time and frequency. Among variables that can be included, we selected fundamental frequency shifts and harmonic number band switching. The fundamental frequency shifts result in pitch shifts, and harmonic number band switching varies timbre. To realize drastic changes in timbre, we group harmonic components into several bands based on harmonic numbers and remove harmonics in every other band. We call these stimuli stripe tones. Furthermore, two extreme combinations of pitch and timbre shift directions—a congruent shift and an incongruent shift—are constructed to capture the whole picture of the possible combinations of these variables. The congruent shift is defined as the combination of upward fundamental frequency shifts and an odd-numbered-band tone to an even-numbered-band tone shift and vice versa, and the incongruent shift is defined as the combination of upward fundamental frequency shifts and an even-numbered-band tone to an odd-numbered-band tone shift and vice versa.
We constructed stripe tone sequences with the LHL_ and HLH_ triplet patterns. LHL_ tone sequences were combined with congruent shifts in experiment 1 and incongruent shifts in experiment 2 (Fig 2). HLH_ tone sequences were combined with congruent shifts in experiment 3 and incongruent shifts in experiment 4 (Fig 3).
Stripe tones are harmonic complex tones divided into 2, 4, 8, or 16 bands. Frequency components are grouped into several bands based on harmonic numbers (Tables 1 and 2), and the components in every other band are removed. The tone duration is 80 ms. Two possible spectral patterns (with only odd- or even-numbered bands) are switched between L and H tones to make a sequence. An LHL_ (“_” denotes a “silent gap”) triplet is presented three times with 80-ms gaps. The columns are in the order of 2-, 4-, 8-, and 16-band stimuli from left to right. The upper two rows (a–h) represent congruent sequences used in experiment 1 (fundamental frequencies and spectral patterns move in the same direction), and the lower two rows (i–p) represent incongruent sequences used in experiment 2 (fundamental frequencies and spectral patterns move in the opposite direction). The first and third rows from the top (a–d and i–l) represent stimuli with a four-semitone fundamental frequency separation. The second and fourth rows (e–h and m–p) represent stimuli with a 16-semitone separation. The L-tone fundamental frequency is fixed at 200 Hz. The H-tone fundamental frequency for the four-semitone separation is 252 Hz, and the 16-semitone separation is 504 Hz.
Congruent and incongruent sequences (a–h and i–p) were used in experiments 3 and 4. The columns are in the order of 2-, 4-, 8-, and 16-band stimuli from left to right. The first and third rows (a–d and i–l) represent stimuli with a four-semitone separation, and the second and fourth rows (e–h and m–p) represent stimuli with a 16-semitone separation.
The following three research questions were raised.
- Do timbre differences in stripe tones affect the percentages of segregation responses?
- Does congruency in shift directions of pitch and timbre for stripe tone sequences affect percentages of segregation responses?
- Do the stripe tone sequence patterns, LHL_ and HLH_, interact with the congruency of pitch and timbre shift directions?
With these research questions, we planned to clarify a trade-off between pitch and timbre in sequential stream segregation with stimuli characterized by stripe-like spectral patterns. We found timbre manipulated by the number of bands dominated the results when the number of bands was small (cf. Jhang et al. [38]).
Method
Participants
Eighteen paid listeners (age range: 20–28; mean age: 22.7) with normal hearing (audiometric thresholds 25 dB HL at every octave point from 250 to 8000 Hz, screened with an audiometer, Rion AA-58, Rion Corp., Kokubunji, Japan), participated in the series of experiments. Participants were recruited from Japanese-native students on the Ohashi campus of Kyushu University. The recruitment period for experiments 1 and 2 was from June 15 to July 20, 2022, and for experiments 3 and 4 from September 12 to October 8, 2022. Absolute pitch possessors were screened out based on self-reports. All participants but one had extracurricular musical training (10.1 years on average). This research was conducted with prior approval of the Ethics Committee of Kyushu University (approval ID: 72). All the participants signed a written informed consent.
Conditions
Both pure tones and (full-band) harmonic complex tones with 35 consecutive harmonics were used as control stimuli (Fig 4) as well as exemplars (Fig 5). Stripe tones were used as experimental stimuli.
LHL_ sequences with (a) 4-, (b) 10-, and (c) 16-semitone separation were used in both experiments 1 and 2. HLH_ sequences with (d) 4-, (e) 10-, and (f) 16-semitone separation were used in both experiments 3 and 4. Pure-tone stimuli were also used but are not shown here.
(a) The LHL_ audio exemplar for one stream (integration) with 2-semitone separation and (b) the LHL_ audio exemplar for two streams (segregation) with 18-semitone separation were used in both experiments 1 and 2. HLH_ exemplars for one stream and two streams (c and d) with 2- and 18-semitone separation were used in both experiments 3 and 4. Note that the 2-semitone separation, as in Fig 5(a and c), was narrower than the 4-semitone separation, which was the narrowest (fundamental) frequency separation used in measuring segregation. Likewise, the 18-semitone separation, as in Fig 5(b and d), was wider than the 16-semitone separation, which was the widest (fundamental) frequency separation used in measuring segregation. Pure-tone audio exemplars were also used (not shown here).
Exemplars’ (fundamental) frequency separations were 2 semitones for “one-stream” exemplars and 18 semitones for “two-stream” exemplars. Three variables were manipulated in the control conditions: (fundamental) frequency separation (4, 10, and 16 semitones), control stimulus type (pure tone and full-band harmonic complex tone), and tone sequence pattern (LHL_ and HLH_). The range of (fundamental) frequency separations was determined referring to Haywood and Roberts [13–15] and the results of our pilot experiment.
Four variables were manipulated in the stripe-tone conditions: fundamental frequency separation (4, 10, and 16 semitones), number of bands (2, 4, 8, and 16 bands in Table 1), congruency (congruent and incongruent), and tone sequence pattern (LHL_ and HLH_).
Table 2 shows the correspondence between band numbers and harmonic numbers for stripe tones. Congruency refers to whether or not the direction of fundamental frequency shifts coincides with the direction of spectral pattern movements: congruent as in Figs 2(a)–2(h) and 3(a)–3(h) or incongruent as in Figs 2(i)–2(p) and 3(i)–3(p). For the full set of stimulus spectrograms, see S1–S6 Figs. For each experiment, six control conditions [three steps of (fundamental) frequency separations two control stimulus types] and 12 stripe-tone conditions (three steps of fundamental frequency separations
four steps of the number of bands) were constructed. Thus, 18 conditions were prepared for each experiment.
Stimuli
All stimuli were generated with custom software written in J language [41], with a sampling frequency of 44100 Hz and a quantization of 16 bits. Harmonic complex tone stimuli were generated by adding equal-amplitude components in sine phase. L (fundamental) frequency was fixed at 200 Hz. H (fundamental) frequencies for 2-, 4-, 10-, 16-, and 18-semitone separations were 225, 252, 356, 504, and 566 Hz. The stimulus and gap duration was 80 ms, including 10-ms rise and fall time with raised-cosine amplitude envelopes for a stimulus.
The root-mean-square (RMS) amplitude of each stimulus was equalized. A pure tone of 1000 Hz with the same RMS amplitude was used as a calibration tone. The sound pressure level (SPL) of the calibration tone was adjusted to 70 dB (A). The SPLs at headphone outputs were measured with an artificial ear (Brüel & Kjæ r type 4153, Brüel & Kjæ r Sound & Vibration Measurement A/S, Næ rum, Denmark), a condenser microphone (Brüel & Kjæ r type 4192), and a sound level meter (Brüel & Kjæ r type 2260). Each stimulus SPL was confirmed with a 60-second tone of the same frequency composition.
Procedure
The experiment was conducted in a double-walled sound-attenuating booth (Music Cabin SD3, Takahashi Kensetsu, Kawasaki, Japan). The stimuli were diotically presented to participants through headphones (Beyerdynamic DT 990 PRO, Beyerdynamic GmbH, Heilbronn, Germany). Custom software written with the LiveCode package [42] was used to present the stimuli and to record participants’ responses. The headphones were driven with a universal serial bus (USB) interface (Roland Rubix24, Roland Corp., Shizuoka, Japan) and an amplifier (Luxman, L-505f, Luxman Corp., Yokohama, Japan).
The experimenter explained the concepts of one stream and two streams to the participants using schematic illustrations (Fig 6), written explanations, and audio exemplars. First, participants were instructed that they would hear exemplars of a clearly integrated sequence called “one stream.” Then, two-semitone exemplars with pure and full-band tones were presented. Subsequently, they were told that they would hear exemplars sound like a distinctly segregated sequence called “two streams,” and 18-semitone exemplars were presented. Participants could ask for presentations to be repeated.
Schematic illustrations for an LHL_ sequence were used in experiments 1 and 2: (a) a test sequence, (b) perception of one stream, and (c) perception of two streams. Similar schematic illustrations were used for an HLH_ sequence in experiments 3 and 4 (d–f). The dashed lines represent perceptual connections between tones. Participants were instructed to focus on the final triplet (marked with squares) and report whether they heard one stream or two streams. Reproduced with translations from Japanese to English.
The participants were instructed to focus on the final triplet of the three-triplet sequences and to report whether they heard one stream or two streams (Fig 6). They were also instructed to avoid trying to listen specifically for either integration or segregation, but rather simply report which of the two percepts was more dominant. Participants were instructed to select a “One” or “Two” response button on a computer screen. Their selections were confirmed with a message box each time, providing an opportunity to correct an erroneous input.
Each trial started after a three-second pause to eliminate the build-up effect on stream segregation from previous trials. Each trial block consisted of 18 trials of conditions that were randomly ordered. After two practice trial blocks, 20 main trial blocks were run. Thus, each condition was measured 20 times. Participants were allowed to take a break between the blocks.
A group of nine participants, including the participant who had no extracurricular musical training, were assigned to the experiments in the order of 1, 2, 4, and 3. Another group of nine participants were assigned to the experiments in the order of 2, 1, 3, and 4; however, one participant did not participate in experiments 3 and 4. Therefore, the data from the last mentioned participant was excluded from the following analysis.
The series of experiments was conducted on different days for each participant. Each participant completed four experiments within 110 days, taking 60 to 100 minutes to complete each one. Experiments 1 and 2 were conducted within 15 days. Then, after more than 44 days, experiments 3 and 4 were conducted within 17 days.
Statistical method
Statistical analysis with a generalized linear mixed model (GLMM) was performed with a logit link function as implemented in JMP Pro [43]. The results of control and stripe-tone conditions were separately analyzed. Regarding the control conditions, the data were analyzed for the fixed effects of (fundamental) frequency separation, control stimulus type, tone sequence pattern (all categorical predictors), and their interactions. The analysis model with these fixed effects was fitted to the data with the following candidate sets of random effects: (1) congruency and participant, (2) congruency and extracurricular musical training nested under participant, (3) congruency and trial block (i.e., block number 1–20 in each experiment) nested under participant, and (4) congruency, trial block nested under participant, and experiment order.
Regarding the stripe-tone conditions, the data were analyzed for the fixed effects of fundamental frequency separation, number of bands, congruency, tone sequence pattern (all categorical predictors), and their interactions. The analysis model with these fixed effects was fitted to the data with the following candidate sets of random effects: (1) participant, (2) extracurricular musical training nested under participant, (3) trial block nested under participant, and (4) trial block nested under participant and experiment order. An appropriate model was selected by examining twice the negative of the residual log pseudo-likelihood (–2ResidualLogPseudo-Likelihood) and the ratio of the generalized statistic and its degrees of freedom (generalized
). The statistical power of the fixed effects was estimated with JMP Pro [43] based on 1000-times simulation at an alpha level of 0.05. Post-hoc multiple comparisons with Tukey’s honestly significant difference (HSD) tests were conducted.
Results
Fig 7 shows the results of experiments 1–4.
The control conditions are irrelevant to congruency. Data points were slightly shifted along the abscissa to show error bars reflecting standard error of the mean (SEM).
Control conditions
The percentages of segregation responses for pure and full-band harmonic complex tones increased with widening (fundamental) frequency separations. Full-band harmonic complex tones showed higher percentages of segregation responses than pure tones except for the four-semitone separation. The effect of tone-sequence patterns was negligible.
These observations were supported by GLMM analysis. A model with random effects of congruency and trial block nested under participant was selected, based on the smallest indices (–2ResidualLogPseudo-Likelihood = 56803.2; generalized ). The model revealed fixed effects of (fundamental) frequency separation [
, p<0.001, statistical power = 0.998], control stimulus types [
,
, statistical power = 0.886], and their interaction [
, p = 0.002, statistical power = 0.731]. No other fixed effects reached p levels < 0.05 (S1 Table). Multiple comparisons with Tukey’s HSD tests revealed that full-band harmonic complex tones showed greater percentages of segregation responses than pure tones at 10-semitone separation (t = −18.49,
) and 16-semitone separation (t = −15.21,
), but not at 4-semitone separation (t = 0.6,
).
Stripe-tone conditions
The results of stripe-tone conditions showed the interaction among fundamental frequency separation, number of bands, and congruency. In the congruent conditions [Fig 7(a) and 7(c)], when the number of bands was small, the percentages of segregation responses were constantly high irrespective of fundamental frequency separations. Conversely, when the number of bands became larger, the percentages of segregation responses depended more and more on the fundamental frequency separations. By contrast, the incongruent combinations of the fundamental frequency shifts and the spectral pattern movements [Fig 7(b) and 7(d)] produced a strong interaction effect between fundamental frequency separation and number of bands. The interaction was most obvious between the 8-band stimuli and 16-band stimuli.
Fig 8 clearly shows the effect of congruency on the results. The segregation responses for the 8-band stimuli dropped in the incongruent conditions (especially with 10-semitone separation). In contrast, the segregation responses for the 16-band stimuli increased a bit in the incongruent conditions with 10- and 16-semitone separations. Tone sequence patterns had almost no effect on the results.
The data in Fig 7 were replotted. (a–d) 2-, 4-, 8-, and 16-band stimuli. Error bars reflect SEM.
These observations were supported by GLMM analysis. A model with random effects of trial block nested under participant and experiment order was selected, based on –2ResidualLogPseudo-Likelihood = 132565.59 and generalized . The model revealed the two-way interaction effect of number of bands
fundamental frequency separation [
,
, statistical power = 0.874] and the three-way interaction effect of congruency
number of bands
fundamental frequency separation [
,
, statistical power = 0.926]. No other fixed effects reached p levels < 0.05 (S2 Table). Tukey’s HSD tests revealed that congruent 8-band stimuli were more frequently judged to be segregated than incongruent 8-band stimuli at 4-semitone (t = 11.99,
) and 10-semitone separation (t = 12.51,
). By contrast, with 16 bands, incongruent stimuli were more frequently judged to be segregated than congruent stimuli at 10-semitone (t = −7.88,
) and 16-semitone separation (t = −6.84,
). However, congruent 16-band stimuli may be more frequently judged to be segregated at 4-semitone separation (t = 3.66, p = 0.046). All other comparisons at 4-, 10-, and 16-semitone separations between the corresponding conditions concerning congruency and number of bands resulted in p levels exceeding 0.05. Moreover, incongruent 8-band stimuli were more frequently judged to be segregated than incongruent 16-band stimuli at the 4-semitone separation (t = 13.78,
); however, incongruent 16-band stimuli were more frequently judged to be segregated than incongruent 8-band stimuli at 10-semitone (t = −13.19,
) and 16-semitone separations (t = −9.02,
).
Discussion
Replications
A larger (fundamental) frequency separation, thus a greater pitch difference, caused more frequent segregation responses for both pure-tone and full-band stimuli. Additionally, full-band stimuli with many harmonic components tended to be more segregated than pure-tone stimuli. Rajasingam et al. [19] and Roberts and Haywood [21] found that two-component complex tones were perceived to be more segregated than pure tones. The current results seem to be in line with those of previous studies.
Pitch and timbre trade-off in stream segregation
The interaction effect of fundamental frequency separation, number of bands, and congruency was observed for stripe-tone conditions. This means that the pitch and timbre trade-off on sequential stream segregation was observed overall. The most obvious trade-off between pitch and timbre was observed with the eight-band stimuli. These results suggest that the auditory system captures the spectral patterns switching every 80 ms and groups the same patterns as a stream. However, the auditory system has difficulty combining blocks of components dispersed over the spectrotemporal domain. When the number of bands increased further, the stimuli got closer to full-band stimuli, leading to similar response patterns.
Previous findings on a trade-off between pitch and timbre by shifting four consecutive harmonics [7], shifting a single formant peak [32], and tilting spectral slopes [34] are consistent with the current findings. Moreover, we showed that timbre can be a strong cue for stream segregation against a pitch difference as small as four semitones, which normally leads to an integrated percept of the three-triplet tone sequences [13–16]. Manipulating the number of bands was reflected in the percentages of segregation responses. Thus, the first research question, “Do timbre differences in stripe tones affect the percentages of segregation responses?” was answered with “Yes.”
The answer to the second research question, “Does congruency in shift directions of pitch and timbre for stripe tone sequences affect percentages of segregation responses?” is that “It depends on the number of bands.” No effect of congruency was observed for two- and four-band stimuli: These stimuli were always perceived to be segregated, irrespective of the sizes of pitch separations and congruency. Listeners perceived congruent eight-band stimuli as segregated most of the time. However, they reported segregation less frequently for incongruent eight-band stimuli. In contrast, their proportions of segregation responses varied mainly according to the sizes of pitch separations for 16-band stimuli: Segregation responses increased as pitch separations became greater. Proportions of segregation responses became greater for incongruent stimuli than for congruent stimuli with 10 or 16 semitone separations, very probably due to the absence of any frequency components in the lowest band of L tones for the congruent 16-band stimuli [Table 2 and Fig 2(h)]. Still, the differences due to congruency were slight. Thus, when the number of bands is two or four, prominent timbre differences between odd- and even-numbered-band stripe tones should mainly govern perceptual segregation. When the number of bands becomes eight or more, timbre differences between odd- and even-numbered-band stripe tones become less competitive with pitch differences, leading to the segregation responses being more governed by pitch differences.
During the revision process of the current paper, we were acknowledged that our another paper on sequential stream segregation by using band tones was accepted for publication [38]. Band tones are harmonic complex tones that consist of odd- or even-numbered bands with fixed passbands (comparable to Table 1), unlike the fixed harmonic number groups used for stripe tones (Table 2). With band tones, listeners perceived eight-band tones as segregated most of the time, except for the congruent four-semitone separation. Moreover, with eight bands, segregation responses were reported more often for band tones than for stripe tones. Thus, the current results indicated that the trade-off between pitch and timbre on stream segregation appeared with a sharper contrast with stripe tones than band tones, suggesting that the effects of pitch and timbre differences on perceptual segregation and integration are somewhat balanced around eight-band stimuli.
The answer to the third question, “Do the stripe tone sequence patterns, LHL_ and HLH_, interact with the congruency of pitch and timbre shift directions?” was “No.” Tone sequence patterns had practically no effect on segregation, replicating previous results by Thomassen et al. [20] with pure tones.
It is out of the scope of the current investigation whether the participants’ responses were based on pitch, timbre, or both because the experimental task was to judge whether they perceived one stream or two streams. Nevertheless, it is worth considering that the current experimental paradigm mimics a situation similar to a two-talker or two-instrument alternation with two pitches. When the number of bands is small, the situation looks somewhat similar to a two-talker or two-instrument alternation, and the auditory system tends to segregate two streams. Whereas, as the number of bands increases, the situation gets similar to a single-talker or single-instrument alternation, then the auditory system integrates two tones into one stream more frequently, depending on the fundamental frequency separations and thus pitch separations. Obviously, the auditory system shows limits in integrating spectrotemporally scattered harmonics.
Conclusions
The effects of pitch and timbre separation on sequential stream segregation were investigated. To make the timbral separation, harmonic complex tones with 35 frequency components were divided into 2 to 16 bands based on harmonic numbers, harmonics in every other band were removed, and the resulting two possible stripe-like patterns were alternated with each tone. The stimuli with a few bands elicited strong stream segregation against pitch proximity. By contrast, the results for the stimuli with 16 bands were dominated by pitch separation, similar to full-band control stimuli. The trade-off between pitch and timbre on stream segregation appeared most clearly in the results for eight-band stimuli. The results suggest that the auditory system captures rapidly changing spectral patterns and groups sounds with similar spectral patterns. At the same time, the results also suggest that the auditory system has limits in integrating blocks of frequency components dispersed over frequency and time with a small number (four or fewer) of bands. The current investigation formed a basis for further investigations on detecting an auditory target in a noisy background.
Supporting information
S1 Fig. Spectrograms of audio exemplars.
LHL_ triplets with pure tones and full-band harmonic complex tones (a–b and c–d) were used in both experiments 1 and 2. HLH_ triplets with pure tones and full-band harmonic complex tones (e–f and g–h) were used in both experiments 3 and 4. The left column represents exemplars with 2-semitone separation for explaining the one-stream concept, and the right column represents exemplars with 18-semitone separation for explaining the two-stream concept
https://doi.org/10.1371/journal.pone.0323964.s001
(TIF)
S2 Fig. Spectrograms of control stimuli.
LHL_ triplets with pure tones and full-band harmonic complex tones (a–c and d–f) were used in both experiments 1 and 2. HLH_ triplets with pure tones and full-band harmonic complex tones (g–i and j–l) were used in both experiments 3 and 4. The columns are in the order of 4-, 10-, and 16-semitone separation from left to right.
https://doi.org/10.1371/journal.pone.0323964.s002
(TIF)
S3 Fig. Spectrograms of two-band stripe tones.
LHL_ triplets with congruent and incongruent stripe tones (a–c and d–f) were used in experiments 1 and 2. HLH_ triplets with congruent and incongruent stripe tones (g–i and j–l) were used in experiments 3 and 4.
https://doi.org/10.1371/journal.pone.0323964.s003
(TIF)
S4 Fig. Spectrograms of two-band stripe tones.
LHL_ triplets with congruent and incongruent stripe tones (a–c and d–f) were used in experiments 1 and 2. HLH_ triplets with congruent and incongruent stripe tones (g–i and j–l) were used in experiments 3 and 4.
https://doi.org/10.1371/journal.pone.0323964.s004
(TIF)
S5 Fig. Spectrograms of eight-band stripe tones.
LHL_ triplets with congruent and incongruent stripe tones (a–c and d–f) were used in experiments 1 and 2. HLH_ triplets with congruent and incongruent stripe tones (g–i and j–l) were used in experiments 3 and 4.
https://doi.org/10.1371/journal.pone.0323964.s005
(TIF)
S6 Fig. Spectrograms of 16-band stripe tones.
LHL_ triplets with congruent and incongruent stripe tones (a–c and d–f) were used in experiments 1 and 2. HLH_ triplets with congruent and incongruent stripe tones (g–i and j–l) were used in experiments 3 and 4.
https://doi.org/10.1371/journal.pone.0323964.s006
(TIF)
S1 Table. GLMM analysis summary for the control conditions.
https://doi.org/10.1371/journal.pone.0323964.s008
(PDF)
S2 Table. GLMM analysis summary for the stripe-tone conditions.
https://doi.org/10.1371/journal.pone.0323964.s009
(PDF)
Acknowledgments
The authors would like to thank Yoshitaka Nakajima for providing J language software routines, and Hikaru Eguchi for providing a valuable framework for the LiveCode program.
References
- 1.
Bregman AS. Auditory scene analysis: the perceptual organization of sound. Cambridge, MA: MIT Press. 1990. https://doi.org/10.7551/mitpress/1486.001.0001
- 2. Bregman AS, Campbell J. Primary auditory stream segregation and perception of order in rapid sequences of tones. J Exp Psychol. 1971;89(2):244–9. pmid:5567132
- 3. Miller GA, Heise GA. The trill threshold. J Acoust Soc Am. 1950;22(5):637–8.
- 4.
van Noorden LPAS. Temporal coherence in the perception of tone sequences. Eindhoven University of Technology. 1975.
- 5. Bregman AS. Auditory streaming is cumulative. J Exp Psychol Hum Percept Perform. 1978;4(3):380–7. pmid:681887
- 6. Anstis S, Saida S. Adaptation to auditory streaming of frequency-modulated tones. J Exp Psychol Hum Percept Perform. 1985;11(3):257–71.
- 7. Singh PG. Perceptual organization of complex-tone sequences: a tradeoff between pitch and timbre? J Acoust Soc Am. 1987;82(3):886–99. pmid:3655122
- 8. Hartmann WM, Johnson D. Stream segregation and peripheral channeling. Music Perception. 1991;9(2):155–83.
- 9. Singh PG, Bregman AS. The influence of different timbre attributes on the perceptual segregation of complex-tone sequences. J Acoust Soc Am. 1997;102(4):1943–52. pmid:9348673
- 10. Cusack R, Roberts B. Effects of differences in timbre on sequential grouping. Percept Psychophys. 2000;62(5):1112–20. pmid:10997053
- 11. Rose MM, Moore BC. Effects of frequency and level on auditory stream segregation. J Acoust Soc Am. 2000;108(3 Pt 1):1209–14. pmid:11008821
- 12. Cusack R, Roberts B. Effects of differences in the pattern of amplitude envelopes across harmonics on auditory stream segregation. Hear Res. 2004;193(1–2):95–104. pmid:15219324
- 13. Haywood NR, Roberts B. Build-up of the tendency to segregate auditory streams: resetting effects evoked by a single deviant tone. J Acoust Soc Am. 2010;128(5):3019–31. pmid:21110597
- 14. Haywood NR, Roberts B. Effects of inducer continuity on auditory stream segregation: comparison of physical and perceived continuity in different contexts. J Acoust Soc Am. 2011;130(5):2917–27. pmid:22087920
- 15. Haywood NR, Roberts B. Build-up of auditory stream segregation induced by tone sequences of constant or alternating frequency and the resetting effects of single deviants. J Exp Psychol Hum Percept Perform. 2013;39(6):1652–66. pmid:23688330
- 16. Rankin J, Osborn Popp PJ, Rinzel J. Stimulus pauses and perturbations differentially delay or promote the segregation of auditory objects: psychoacoustics and modeling. Front Neurosci. 2017;11:198. pmid:28473747
- 17. Rajasingam SL, Summers RJ, Roberts B. Stream biasing by different induction sequences: evaluating stream capture as an account of the segregation-promoting effects of constant-frequency inducers. J Acoust Soc Am. 2018;144(6):3409–20. pmid:30599694
- 18. Gustafson SJ, Grose J, Buss E. Perceptual organization and stability of auditory streaming for pure tones and /ba/ stimuli. J Acoust Soc Am. 2020;148(2):EL159–65. pmid:32873027
- 19. Rajasingam SL, Summers RJ, Roberts B. The dynamics of auditory stream segregation: effects of sudden changes in frequency, level, or modulation. J Acoust Soc Am. 2021;149(6):3769–84. pmid:34241493
- 20. Thomassen S, Hartung K, Einhäuser W, Bendixen A. Low-high-low or high-low-high? Pattern effects on sequential auditory scene analysis. J Acoust Soc Am. 2022;152(5):2758–68. pmid:36456271
- 21. Roberts B, Haywood NR. Asymmetric effects of sudden changes in timbre on auditory stream segregation. J Acoust Soc Am. 2023;154(1):363–78. pmid:37462404
- 22. Haywood NR, McAlpine D, Vickers D, Roberts B. Factors influencing stream segregation based on interaural phase difference cues. Trends Hear. 2024;28:23312165241293787. pmid:39654440
- 23. Grimault N, Micheyl C, Carlyon RP, Arthaud P, Collet L. Influence of peripheral resolvability on the perceptual segregation of harmonic complex tones differing in fundamental frequency. J Acoust Soc Am. 2000;108(1):263–71. pmid:10923890
- 24. Madsen SMK, Dau T, Moore BCJ. Effect of harmonic rank on sequential sound segregation. Hear Res. 2018;367:161–8. pmid:30006111
- 25. Grimault N, Micheyl C, Carlyon RP, Arthaud P, Collet L. Perceptual auditory stream segregation of sequences of complex sounds in subjects with normal and impaired hearing. Br J Audiol. 2001;35(3):173–82. pmid:11548044
- 26. Vliegen J, Oxenham AJ. Sequential stream segregation in the absence of spectral cues. J Acoust Soc Am. 1999;105(1):339–46. pmid:9921660
- 27. Vliegen J, Moore BC, Oxenham AJ. The role of spectral and periodicity cues in auditory stream segregation, measured using a temporal discrimination task. J Acoust Soc Am. 1999;106(2):938–45. pmid:10462799
- 28. Grimault N, Bacon SP, Micheyl C. Auditory stream segregation on the basis of amplitude-modulation rate. J Acoust Soc Am. 2002;111(3):1340–8. pmid:11931311
- 29. Hong RS, Turner CW. Sequential stream segregation using temporal periodicity cues in cochlear implant recipients. J Acoust Soc Am. 2009;126(1):291–9. pmid:19603885
- 30. Nie Y, Nelson PB. Auditory stream segregation using amplitude modulated bandpass noise. Front Psychol. 2015;6:1151. pmid:26300831
- 31. Roberts B, Glasberg BR, Moore BCJ. Primitive stream segregation of tone sequences without differences in fundamental frequency or passband. J Acoust Soc Am. 2002;112(5 Pt 1):2074–85. pmid:12430819
- 32. Bregman AS, Liao C, Levitan R. Auditory grouping based on fundamental frequency and formant peak frequency. Can J Psychol. 1990;44(3):400–13. pmid:2224643
- 33. Marozeau J, Innes-Brown H, Blamey PJ. The effect of timbre and loudness on melody segregation. Music Perception. 2012;30(3):259–74.
- 34. Oh Y, Zuwala JC, Salvagno CM, Tilbrook GA. The impact of pitch and timbre cues on auditory grouping and stream segregation. Front Neurosci. 2022;15:725093. pmid:35087369
- 35. Sauvé SA, Marozeau J, Zendel BR. The effects of aging and musicianship on the use of auditory streaming cues. PLoS One. 2022;17(9):e0274631. pmid:36137151
- 36. Ueda K, Kawakami R, Takeichi H. Checkerboard speech vs interrupted speech: effects of spectrotemporal segmentation on intelligibility. JASA Express Lett. 2021;1(7):075204. pmid:36154646
- 37. Ueda K, Doan LLD, Takeichi H. Checkerboard and interrupted speech: intelligibility contrasts related to factor-analysis-based frequency bands. J Acoust Soc Am. 2023;154(4):2010–20. pmid:37782122
- 38. Jhang G-Y, Ueda K, Takeichi H, Remijn GB, Hasuo E. Band tones: auditory stream segregation with alternating frequency bands. Acoust Aust. 2025.
- 39. Ueda K, Nakajima Y. An acoustic key to eight languages/dialects: factor analyses of critical-band-filtered speech. Sci Rep. 2017;7:42468. pmid:28198405
- 40. Zwicker E, Terhardt E. Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J Acoust Soc Am. 1980;68:1523–5.
- 41.
J Software. The J programming language. Version J903 [computer language]; 2021. https://www.jsoftware.com
- 42. LiveCode. LiveCode Community. Version 9.6.3 [computer language]; 2021. https://livecode.com
- 43. SAS Institute Inc. JMP Pro. Version 18.1.0 [computer program]; 2024. https://www.jmp.com