Fig 1.
Stimulus schematic for the generalized Shepard tones used in Exp. 1 and 2.
(A) Spectral fine structure (SFS, blue) is illustrated here as a set of harmonic components at integer multiples of a fundamental frequency f0. SFS could be shifted in 1-semitone (st) steps, while manipulating the amplitude of odd harmonics at each step to induce perceptual circularity across a 12-st shift. Inharmonic stimuli were created by jittering the individual frequency components around their nominal values. (B) Spectral envelope (SE, red) had peaks at octave-related frequencies superimposed with a fixed bell-shaped envelope. SE could also be shifted in st steps and displayed exact acoustic circularity across a 12-st shift. (C) Example stimulus for a given combination of SFS and SE. Frequency components are constrained by the SE (red line) and follow the harmonic series of the SFS and its even-odd attenuation (indicated by blue lines). The resulting amplitude of the actual stimulus components is shown by black lines. Note that some components are lower than their nominal SE amplitude, because of the even/odd attenuation. (D) Schematic depiction of example trial with two-dimensional shift of 6 st in the SFS dimension and 11 st in the SE dimension. After each trial, participants were asked to to judge whether the sound pair moved up, down, both (up dominant), or both (down dominant). Stimuli for Exp. 3 were generated in a similar manner except that the SFS was not harmonic, but with fixed distance on a log-frequency scale, allowing for exact acoustic circularity for a shift corresponding to their log-frequency distance.
Fig 2.
Proportion “down” responses for one-dimensional shifts of Exp. 1 for harmonic (A) and inharmonic (B) Shepard sounds.
Colored lines correspond to the fitted GLME model. Black vertical lines indicate 95% confidence intervals across participants obtained by bootstrapping.
Fig 3.
Proportion “down” responses for two-dimensional shifts of Exp. 2 for harmonic (A) and inharmonic (B) Shepard sounds.
Colored lines indicate the fixed effects of the GLME model. Black vertical lines indicate 95% confidence intervals obtained by bootstrapping.
Fig 4.
Proportion “down” responses in Exp. 3 with log-equidistant partials for one-dimensional shifts (A) and two-dimensional shifts (B).
Colored lines indicate the fixed effects of the GLME model. Black vertical lines indicate 95% confidence intervals obtained by bootstrapping.
Fig 5.
(A) Schematic of cue computation and integration, illustrated with a pair of harmonic sounds shifted by 11 and 1 st along the SFS and SE dimension, respectively. For the AC cue, autocorrelations within each auditory band are computed and subsequently summed across bands. From the resulting summed autocorrelation function, peaks are extracted from a search range corresponding to 64–128 Hz (gray patch) and subtracted, yielding an estimate of the periodicity difference of the two sounds; positive differences indicate downward movement. CC cues are computed using the rms-energy of bands across time, followed by a logarithm (conversion to level-domain) and thresholding below -30 dB peak level. The CCres cue considers the presumably resolved part of the excitation pattern below 905 Hz (geometric mean of the frequency of the 10th partial component across Exps. 1 and 2); the CCunres cue considers the unresolved part above that frequency limit. For both types of cues, the remaining excitation patterns are cross-correlated and the centroid (first moment) is computed in a search range (gray patches). Positive centroids indicate downward movement. Cues are integrated in a weighted sum (positive sum coded as “down”). (B) Possible cue weightings of sum one. (C) Cue weights projected onto x-y plane; the farther a point’s distance to a specific edge, the less weight the respective cue receives. The corresponding model fit is displayed on z-axis for one human listener’s results in the inharmonic SFS-EN condition. The dashed line shows the optimal weights for this example.
Fig 6.
Model results and human listeners across all experiments.
The y-axis shows the proportion of “down” responses. The x-axis indicate shift size; for two-dimensional shifts, pairs indicate shifts along the SFS and SE dimension, respectively. The rows index the properties of the fine structure (harmonic, inharmonic (Exps. 1&2) and log-equidistant (Exp. 3). The columns index the shift type, one-dimensional SFS or SE, and two-dimensional SFS-SE with average data from human listeners and the model plotted in thick and thin colored lines, respectively. Raw predictions from the AC, CCres, and CCunres cues as indicated in the legend. R2 values for correlation between average model and empirical data (omitted for log-equidistant SFS and SE shifts of Exp. 3 due to insufficient number of shifts). R2 values for raw cues are provided in Table E in S1 Text.
Fig 7.
Generalization across shifts and participants.
(A) R2 values of models fitted on a subset of shifts (x-axis) and evaluated on the full set of shifts. Thin gray lines display individual data from individual participants, the thick blue line the average across participants. For every subset of shifts, all possible combinations of subsets were used and the average across all combinations is shown. (B) R2 values of models fitted for one participant (x-axis) and tested on another participant (y-axis).
Fig 8.
Cue weights for all conditions and participants with panels sorted as in Fig 7.
Constrained to sum up to one, cue weights live on a two-dimensional hyperplane with triangular borders. Points in the triangular space correspond to weightings of the three cues AC, CCres, and CCunres (weights sum up to one). Large symbols correspond to the medians of the optimal weighting for individual listeners as computed from 2000 bootstrap iterations. Gray symbols highlight spread of bootstrap samples (the darker, the more overlap from different bootstrap iterations; specifically plotting log10 of the empirical density function of the bootstrap distribution).