Multiple timescales of context influence perceptual sensitivity to common pairings of musical pitch and timbre

Christian E. Stilp; Isabel Adames; Anya E. Shorey

doi:10.1371/journal.pone.0328490

Abstract

Previous studies have established that musical pitch and timbre (specifically, spectral shape) perceptually covary: lower pitches are associated with darker timbres (less higher-frequency energy) and higher pitches are associated with brighter timbres (more higher-frequency energy). In four experiments, perceptual sensitivity to this relationship was assessed in pitch labeling tasks when instrument timbre varied in ways that respected or violated this pattern (Consistent or Reversed trials). Performance was influenced by context at multiple timescales: block-level (stimulus type), experimental session-level (block order or configuration), and longer-term experience (musical training background). Across experiments, participants performed near ceiling accuracy for Consistent stimuli, but were less accurate for Reversed stimuli. This pattern was moderated by which condition was tested first in the experiment, the introduction of trial-by-trial feedback, and presentation of trials in blocked versus interleaved orders. Higher musical training scores were generally associated with higher accuracy on Consistent trials but were more reliably and more strongly associated with higher accuracy on Reversed trials. Thus, context on multiple timescales can shape perceptual sensitivity to the natural covariance between musical pitch and timbre. Results advance the efficient coding hypothesis by demonstrating how listener factors can modulate perceptual sensitivity to statistical structure in the acoustic environment.

Citation: Stilp CE, Adames I, Shorey AE (2025) Multiple timescales of context influence perceptual sensitivity to common pairings of musical pitch and timbre. PLoS One 20(7): e0328490. https://doi.org/10.1371/journal.pone.0328490

Editor: Gavin M. Bidelman, Indiana University, UNITED STATES OF AMERICA

Received: January 3, 2025; Accepted: July 1, 2025; Published: July 18, 2025

Copyright: © 2025 Stilp et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Stimuli, deidentified processed data, and analysis scripts for all main experiments are available in an Open Science Framework repository, https://osf.io/fpj8q/.

Funding: This work was partially supported by National Institutes of Health, National Institute on Deafness and Other Communication Disorders, Grant. No. R01 DC020303. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. There was no additional external funding received for this study.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Natural sounds are acoustically complex, varying along many different acoustic dimensions. While myriad studies have demonstrated perceptual difficulties in the face of acoustic variability, here we focus on the fact that any such dimensions (and their respective variabilities) are not necessarily independent of one another. Studies in both speech and music perception have documented instances of perceptual interference, in which perception of a sound property or identity is impeded by concurrent variability in another feature. In speech perception, recognition of a word is impeded by variability in who spoke it [1–4] or how it was produced [5–6]. Likewise, recognition of a talker is impeded by variability in the words being spoken [1]. In music perception, recognition of the pitch of a tone or chord is impeded by variability in the instrument that produced it [7–12, but see 13–14]. Likewise again, timbre-based judgments are challenged by concurrent pitch variability [8–9, 15–16; also see 17].

According to the efficient coding hypothesis, sensory systems have adapted and evolved to be highly sensitive to structure in the sensory environment [18–19]. To mitigate the perceptual difficulties incurred by acoustic variability in a sound feature(s), here we focus on patterns of covariance among sound features. Listeners implicitly and automatically learn patterns of covariation (in some cases quite rapidly [20]), and perception often benefits when sound features respect these patterns [20–22]. Among the many examples available, here we focus on perceptual benefits from patterns of covariance involving fundamental frequency (f0) or pitch, as it is a pervasive and significant feature in speech and music alike.

Two patterns of covariation involving f0 are reviewed here. First, in speech, considerable covariance exists between f0 and formant frequencies (resonances of their vocal tract, which shift to lower or higher frequencies for longer or shorter vocal tract lengths, respectively). Across talkers, those with lower f0s tend to have longer vocal tracts and lower formant frequencies (e.g., adult cisgender men), and those with progressively higher f0s also tend to have progressively shorter vocal tracts with correspondingly higher formant frequencies (e.g., moving from adult cisgender men to adult cisgender women to children). Across all of the vowels in the Hillenbrand et al. [23] database, a vowel’s f0 and its first three formant frequencies shared an average of 77% of their variability [24]. Assmann and Nearey [25] revealed that vowels spoken by different talkers were recognized more accurately when f0 and formant frequencies respected these patterns of covariance (e.g., low f0 with lower formant frequencies) compared to when vowels violated these patterns (e.g., low f0 with higher formant frequencies).

Second, in music, a variety of studies reported that pitch perception was supported by coherent variation in instrument timbre. In discrimination tasks, pitch discrimination as well as timbre discrimination markedly improved when the other dimension was varying congruently (e.g., higher pitches occurring with brighter timbres) versus incongruently (e.g., higher pitches occurring with darker timbres [26–27]). In pitch interval estimation tasks, melodic intervals were perceived as being larger when accompanied by congruent changes in timbre (an ascending interval with the timbre moving from darker to brighter across tones) than incongruent changes (the same ascending interval with timbre moving from brighter to darker [28–29]; see also [30–31]. Additionally, under certain conditions, musical tones with coherent pairings of pitch and brightness are rated as being more pleasant than tones with incoherent pairings [32]. The association between higher pitch heights and higher vertical positions in space (Spatial-Musical Association of Response Codes, or SMARC [33–34]) only holds when pitch and brightness are covarying [35]. However, associations between pitch and timbre might vary across instruments, given differing degrees of pitch-invariance observed in studies by McAdams and colleagues [36] and by Siedenburg and colleagues [37].

Perceptual sensitivity to patterns of covariance among sound features in speech and music is likely shaped by long-term perceptual experience and/or expertise, but studying this particular influence is dotted with challenges. While one might study children’s perception to examine effects of experience, it is difficult to distinguish poorer sensitivity to stimulus covariance from immature perceptual faculties. In adulthood, perceptual faculties are functionally if not fully mature, but listeners’ incomparable experience hearing their native language(s) far exceeds conventional definitions of perceptual expertise, thus making it difficult to study effects of experience/expertise on perceptual sensitivity to the statistical regularities therein. On the other hand, musical expertise varies widely in adult listeners, making it well-suited for this research question.

To effectively examine how musical background is related to perceptual sensitivity to pitch-timbre covariance, two shortcomings require addressing. First, pitch-timbre covariance has rarely been the focus of previous investigations, which instead examined pitch perception or pitch-timbre interference. Other studies focusing on this relationship have been primarily analytic rather than primarily perceptual [37]. But, indirect evidence is available to suggest that musical training shapes perceptual sensitivity to the natural covariance between pitch and timbre. Musical training provides a degree of resilience to pitch-timbre interference [7, 9, 15; but see 26]. Other studies have either suggested or reported that nonmusicians are more apt than musicians to confuse pitch and timbre when responding to one of those sound properties [9,38,39]. Such confusions would prove particularly challenging when pitch and timbre violate their natural covariance (e.g., accurately labeling the high pitch of a tone when its timbre is dark). This is supported by Lau and colleagues [39], who compared pitch-timbre confusions of infants, adult nonmusicians, and adult musicians. Infants (whose minimal experience with pitch-timbre covariance made them undeterred by violations of it) and musicians (whose extensive experience made them resilient to these violations) suffered far fewer pitch-timbre confusions than nonmusicians (whose intermediate amount of experience rendered them most susceptible to these violations). These examples highlight the second shortcoming of previous research, that musicianship is frequently treated as a binary variable, testing listeners who classify as either (highly experienced) musicians or nonmusicians. While these extreme comparisons might be utilized to maximize the probability of finding a difference between groups, it concurrently undermines the true nature of the variable it is studying. Amount of musical training is a continuous variable, and dichotomizing continuous variables introduces serious statistical challenges in general [40] and in the music perception literature specifically [41]. Further, one’s musical background is a multidimensional construct. Müllensiefen and colleagues [42] have developed a self-report assessment of what they termed ‘musical sophistication’. This questionnaire produces continuous measures of musical training, singing, auditory perceptual abilities, active engagement with music, and emotional responses to music. Using tools such as these, closer connections between musical training and/or background and music perception (and pitch-timbre covariance specifically) can be forged.

Context exists on a multitude of timescales. In many experiments, perceptual performance is subject to short-time perceptual contexts at the level of the experimental session, such as testing order and format. Specifically, while sounds whose pitch-timbre pairings violate their natural covariance are predicted to challenge perception, the extent of this challenge might by modulated by various session-level factors. This challenge could be potentially exacerbated by testing these stimuli at the very beginning of the experiment versus potentially lessened by testing them later in the experiment (once listeners have some experience with the pitches and timbres being tested). This challenge might also be greater with sparser testing of these stimuli and lessened through extended testing. Additionally, overall sensitivity to pitch-timbre pairings might be modulated by blocked testing (completing an entire block of trials where all stimuli either obey or violate the prevailing pattern of covariance) relative to interleaved testing (both trial types are mixed together throughout the test session, more closely modeling how these sounds are encountered in everyday listening). Relevant context also exists on much longer timescales. By adulthood, listeners have had considerable passive exposure to natural covariance between pitch and timbre in both music and in speech. Listeners with musical training have experience above and beyond this through their explicit practice of aural skills, and implicit exposure to a wide range of pitch and timbre combinations in performance and active listening. Measuring how shorter-timescale contexts shape perception and how they interact with longer-timescale contexts such as perceptual expertise is essential for establishing a firm foundation for understanding how perception tracks the natural covariance between musical pitch and timbre.

Here we present results from four experiments that examine perceptual sensitivity to the covariance between musical pitch and timbre. All experiments utilize a pitch labeling task, where each trial presented an instrument playing one tone that listeners labelled as lower pitch (C4) or higher pitch (G4). Exposure and practice with feedback were provided before the main task so that musical training was not a prerequisite for success. Musical background was assessed through the musical training subscale of the Gold-MSI questionnaire [42]. Across all experiments, in accordance with the efficient coding hypothesis, responses were predicted to be faster and more accurate for pitch-timbre pairings that respected their typical covariance (Consistent condition) relative to pairings that violated it (Reversed condition). Performance was also predicted to improve with increased musical training, such that listeners with more extensive musical backgrounds would label pitches more accurately in this task. This outcome would advance the efficient coding hypothesis, demonstrating that sensitivity to statistical structure in the listening environment could be shaped by listener factors such as relevant (musical) perceptual experience.

Materials and methods

Participants.

The study was approved by the Institutional Review Board at the University of Louisville. Participants provided written informed consent and received no financial compensation for their participation.

Target sample sizes for these experiments could not be determined from existing literature, as perceptual sensitivity to pitch-timbre covariance has rarely been investigated directly. Allen and Oxenham [26] measured 20 participants’ ability to discriminate changes in f0 or spectral centroid when the other dimension was varying congruently versus incongruently. Studies that examined sensitivity to pitch-timbre covariance indirectly also had similar or smaller sample sizes [28–30]. McPherson and McDermott [27] tested a larger sample of 105 participants (retaining the data for 86) to “provide evidence for or against the null hypothesis that discrimination was the same for harmonic and inharmonic stimuli”; effects of pitch-timbre congruence were only evaluated via post hoc analysis. Across these studies, results were analyzed at the aggregated level. None of these studies analyzed trial-level data, nor did they analyze responses using mixed-effects regression models.

The appropriate sample size for these experiments was determined in two ways. First, we followed the guidelines of Brysbaert and Stevens [43], who recommended 1600 observations in each condition as a target for properly powered experiments that are being analyzed using mixed-effects regressions. As detailed below, each condition contained 40 trials in an experimental block. Therefore, testing 40 listeners would satisfy this guideline (40 listeners x 40 trials/condition = 1600 observations/condition). Second, we conducted power analyses to estimate target sample sizes. The outcome variable of interest was accuracy, as it more closely aligned with the changes in discriminability as a function of pitch-timbre congruence [23] than response time. Preliminary analyses of listeners’ responses were conducted to inform power analyses, reducing data down to the first two blocks to eliminate potential practice effects (one block apiece of Consistent stimuli and Reversed stimuli as in Experiments 1–3; in Experiment 4 which did not use block structure, all responses were include in analyses). A simplified mixed-effects generalized linear model was constructed analyzing response accuracy with only the fixed effect of condition, random slopes for condition, and random intercepts for listeners. Coefficients from this analysis were extracted to populate a simulated model where sample size was extended to 100 listeners. Power analyses were conducted using the simr [44] package in R, examining the sample size required to yield an acceptable amount of statistical power underlying reliable changes in accuracy across Consistent and Reversed stimuli. The conventional threshold of 80% power for the fixed effect of condition was achieved with sample sizes of 19 (Experiment 1), 19 (Experiment 2), 39 (Experiment 3), and 13 listeners (Experiment 4). This confirmed that testing 40 listeners provided more than sufficient statistical power for the primary outcome variable.

In all, 165 undergraduate students from the Department of Psychological and Brain Sciences at the University of Louisville were tested across four experiments. All reported no known hearing impairments, and received course credit in exchange for their participation. No one participated in multiple experiments. Experiment 1 tested forty undergraduate students (10 men, 28 women, 1 non-binary, 1 not reporting; mean age = 20.79 years, S.D. = 3.47, with 1 not reporting). Experiment 2 tested forty undergraduate students (5 men, 32 women, 1 non-binary, 1 not reporting; mean age = 19.58 years, S.D. = 3.74, with 1 not reporting). Experiment 3 tested forty-one undergraduate students (9 men, 29 women, 3 not reporting; mean age = 20.69 years, S.D. = 5.13, with 1 not reporting). Experiment 4 tested forty-four undergraduate students (three men, 36 women, one non-binary, with four not reporting; mean age = 19.50 years, S.D. = 3.30, with four not reporting).

Statistical analyses.

All data were collected from February 19, 2023 to January 30, 2024. Data were analyzed using R version 4.4.0 [45] and the packages tidyverse version 2.0.0 [46], lme4 version 1.1–35.4 [47], lmerTest version 3.1−3 [48], and emmeans version 1.10.3 [49]. Primary statistical analyses included linear mixed-effects regression models to predict response time on each trial (using the lme4 package) and generalized linear mixed-effects regression models to predict accuracy on each trial (again using the lme4 package). Additional packages that were utilized included ggplot2 version 3.5.1 [50], rio version 1.2.0 [51], cowplot version 1.1.3 [52], ggpubr version 0.6.0 [53], gridExtra version 2.3 [54], ez version 4.4−0 [55], and rstatix version 0.7.2 [56]. This study’s design and its analysis were not pre-registered. Stimuli, deidentified processed data, and analysis scripts for all main experiments are available in an Open Science Framework repository, https://osf.io/fpj8q/.

Stimuli.

Stimuli were recordings of tones played by the trumpet, oboe, trombone, and tuba from the McGill University Master Samples database [57]. These instruments were selected to span a wide range of spectral envelopes (Fig 1a). Recordings of each instrument playing C4 (mean fundamental frequency across the four instruments = 261.33 Hz) and G4 (mean fundamental frequency = 392.14 Hz) were chosen. This large interval was chosen so that notes could be labeled relatively easily by participants with little to no musical training [cf. 11]. These notes occur near the upper end of the registers played by the darker-timbre instruments and toward the lower end of the registers played by the brighter-timbre instruments. Notes were edited to begin at the natural onset of the instrument and terminate at a duration of 1000 ms. A 2-ms linear offset ramp was applied to each recording in MATLAB. Stimuli were then all set to a fixed root mean square amplitude. All experiments used these stimuli.

Download:

Fig 1. Experimental stimuli and design.

(A) Long-term average spectra for the target instruments, arranged in order of decreasing amplitude of high-frequency energy content: trumpet (pink), oboe (yellow), trombone (green), and tuba (blue). Spectra of instruments playing G4 are depicted in the top panel; spectra of instruments playing C4 are depicted in the bottom panel. (B) Pitch-timbre pairings in Consistent blocks: instruments with darker timbres (trombone, tuba) played the lower note (C4), and instruments with brighter timbres (trumpet, oboe) played the higher note (G4). (C) Pitch-timbre pairings in Reversed blocks: instruments with darker timbres played the higher note (G4), and instruments with brighter timbres played the lower note (C4). Musical instrument images courtesy of iStock photos.

https://doi.org/10.1371/journal.pone.0328490.g001

Procedure.

Participants completed the experiment on their personal computers using the online testing platform Gorilla [58]. After the acquisition of informed consent, participants were asked to adjust the computer volume to a comfortable listening level for a sample burst of noise. The full experimental session, which consisted of six steps detailed below, took between 20–30 minutes to complete. All experiments used this same procedure with the exception of the construction of the main task; these are detailed in turn along with each experiment.

First, participants completed a headphone screen to help standardize sound presentation and ensure stimuli were audible. The first screen employed was a tone level discrimination task [59]. On each trial, participants reported which of three tones was the quietest. The correct answer was the tone presented at −6 dB relative to the other two, which is easily distinguishable when listening via headphones. There is also a plausible but incorrect answer of a tone presented 180 degrees out of phase across stereo channels, which sounds quietest over speakers due to destructive interference. Participants who passed the headphone screen had at least five correct responses out of the six trials, within two attempts. Participants who failed the tone level headphone screen twice proceeded to a second headphone screen based on dichotic pitch [60]. In this test, listeners identified which of three noise segments has a faint pitch to it. This pitch, produced by different interaural phase relations in a narrow range of frequencies across the two ears, is detectable over headphones but not over speakers. All participants were retained at this step regardless of whether they passed or failed the headphone screener, since these headphone checks are meant to encourage headphone usage but none of the experimental sounds were dependent upon dichotic sound presentation. Additionally, the authors of these headphone screens acknowledge the possibility of them yielding false negatives [59–60]. Supplementary analyses were conducted after removing responses from listeners who did not pass the headphone screen (n = 5 in Experiment 1, n = 6 in Experiment 2, n = 7 in Experiment 3, n = 5 in Experiment 4). All major patterns of results remained intact; results from these analyses are available in the Open Science Framework repository for this research: https://osf.io/fpj8q/.

Second, participants heard a plucked violin playing the notes C4 and G4 along with the verbal labels of ‘low’ and ‘high’. This was to familiarize participants with the pitches that they were to label as being ‘low’ and ‘high.’ The plucked violin was chosen for presentation in exposure and practice, as this instrument was not heard during the main task. Participants could click a button on the screen to hear these examples up to five times each before proceeding.

Third, participants practiced labeling the plucked violin tones as ‘low’ or ‘high’ pitch 10 times each in random orders (20 total presentations). Participants responded by pressing a corresponding keyboard letter (“e” for low or “i” for high), and feedback was provided on each trial. Participants were required to achieve 90% accuracy in this task and could repeat it two additional times if necessary. Only six participants repeated the practice block, after which point they met the performance criterion. It bears note that instrument timbre was fixed during practice. This was intended to draw listeners’ attention to the pitch dimension, which was essential for their performance in the main experiment. Listeners were not given any instructions or formal training on incongruent pairings of pitch and timbre, as doing so would introduce the risk of artificially inflating performance on those stimuli in the main experiment. At the same time, this methodological decision also presents the risk of poorer performance in the main experiment, whether due to inconsistent responding (e.g., unreliable use of pitch and/or timbre information for making responses) or confusion (e.g., responding to timbre instead of to pitch).

Fourth, the main experiment was patterned after paradigms introduced by Idemaru and Holt [61–62]. On each trial, listeners heard one instrument play one tone. As in practice, participants responded by pressing a keyboard key as quickly and accurately as possible to report whether it was the low tone (C4) or the high tone (G4). The instrument’s pitch and timbre were paired so that they were either congruent with their natural pattern of covariance (Consistent trials: darker timbres of trombone or tuba playing the lower pitch C4, brighter timbres of oboe or trumpet playing the higher pitch G4) or incongruent with this covariance (Reversed trials: darker timbres of trombone or tuba playing the higher pitch G4, brighter timbres of oboe or trumpet playing the lower pitch C4). Most experiments presented stimuli in a blocked manner, where an entire block of 80 trials presented either Consistent sounds or Reversed sounds exclusively (Experiments 1–3; after Idemaru and Holt [61–62]). In Experiment 4, blocks contained both Consistent sounds and Reversed sounds that could occur from trial-to-trial.

Fifth, participants then completed a brief adaptive staircase assessing pitch discrimination abilities [63]. Not all participants successfully completed the staircase task (8 reversals within a maximum of 75 trials). Rather than conduct underpowered analyses due to this data loss, these results are not discussed further.

Sixth and lastly, participants provided responses to the seven-question musical training subscale of the Gold-MSI [42], as well as demographic items (age, gender, hearing health, years of education, income, handedness, native language, tonal language competency, and if they had any technical issues regarding sound presentation).

Experiment 1

Design

Experiment 1 was comprised of three blocks (Figs 1b–1c) patterned after paradigms introduced by Idemaru and Holt [61–62]. In the first (Consistent) block, listeners heard instruments playing pitches that respected their natural pattern of covariance. In the second (Reversed) block, listeners heard the same instruments but now playing pitches incongruent to their relative timbre. Finally, the third (Consistent) block was a repetition of the first block. Each block presented 80 trials (four instruments playing their yoked pitch x 20 repetitions). No feedback was provided. Throughout, participants responded by pressing a keyboard key as quickly and as accurately as possible to report whether they heard the low tone or the high tone.

Results

Response time.

As is customary for response time analyses in speeded labeling tasks [4,6,11], only correct responses were retained (removing 1635 trials, or 17.03% of the total data). Additionally, all response times faster than 200 ms were removed, as these responses were too short to reflect the time needed to hear a stimulus and plan a corresponding motor response (e.g., [64]; removing 81 trials, or 1.02% of the remaining data). Distributions of response times were positively skewed, so they were log-transformed to achieve normality. Finally, mean response time was calculated for each participant, and response times exceeding three standard deviations from that listener’s mean were removed (removing 106 trials, or 0.68% of the remaining data).

Linear mixed-effects modeling was used to predict trial-level response times. Fixed effects in this model included block (factor-coded, with the first [Consistent] block serving as the default), Gold-MSI musical training score (mean-centered), and their interaction. The model building process began with a base model with these fixed effects and random intercepts for participants. Random effects were added one at a time and tested via χ² goodness-of-fit tests. If the added term explained significantly more variance, it was retained. This process continued until all random effects of interest were assessed, noting that the final model also had to successfully converge. The final random effects structure included random slopes for block and random intercepts for participants. Model coefficients are listed in Table 1A.

Download:

Table 1. Mixed-effects modeling results for Experiment 1 (n.b., since models shared fixed effects architectures, results from both models are presented side-by-side for ease of comparison). Results from the linear mixed-effects model analyzing the logarithm of response times are listed at left (A); results from the generalized linear mixed-effects model analyzing response accuracy are listed at right (B). Block 1 (Consistent) was the default level of the factor Block, so all fixed effects are in reference to Block 1.

https://doi.org/10.1371/journal.pone.0328490.t001

The first analysis examined block-level context. Relative to the first Consistent block (estimated marginal mean response time, as calculated using the emmeans package [49] = 829.18 ms, SE = 43.54), response times increased numerically but not significantly in the Reversed block (estimated marginal mean response time = 851.10 ms, SE = 48.13), but decreased significantly in the final Consistent block (estimated marginal mean response time = 686.99 ms, SE = 37.40). An additional contrast was tested by recoding the block factor using the Reversed block as the default level and rerunning the model. Mean response times significantly decreased from the Reversed block to the final Consistent block ( = −0.093, t = −5.19, p < .0001).

The second analysis examined long-term context. Due to a programming error, three participants did not receive all the questions on the Musical Training subscale of the Gold-MSI [42]. One additional participant completed all tasks through the main experiment but did not complete the Gold-MSI questionnaire. For the remaining 36 participants, responses to these seven-point questions were summed according to the scoring guidelines. Subscale scores can span a range of 7 (lowest musical training score) to 49 (highest musical training score). For Experiment 1, the mean score was 20.80 (SD = 10.87), reflecting a wide range of musical training experience in this listener sample.

Gold-MSI scores for each participant are illustrated in Fig 2a, with brighter colors indicating higher scores. These scores were not a significant predictor of trial-level response times in any block (all t > −1.62, p > .11).

Download:

Fig 2. Results from Experiment 1.

(A) Each dot represents the mean response time for a given listener in that experimental condition; each listener’s means across conditions are connected by grey lines. The estimated marginal mean response times for each condition are depicted using black squares, with error bars denoting one standard error. Dots are colored according to each listener’s score on the musical training subscale of the Gold-MSI, with brighter (toward green) colors indicating higher scores and darker (toward black) colors indicating lower scores (see inset legend). Grey dots indicate listeners who did not complete the Gold-MSI. (B) Each dot represents the mean accuracy for a given listener in that experimental condition; each listeners’ means across conditions are connected by grey lines, with error bars denoting one standard error. The estimated marginal mean accuracy for each condition is depicted using black squares. Dots are again colored according to each listener’s score on the musical training subscale of the Gold-MSI (the same as in panel A). Asterisks denote a statistically significant influence of musical training scores on performance in that block (*p < .05, **p < .01, ***p < .001).

https://doi.org/10.1371/journal.pone.0328490.g002

Accuracy.

Unlike response time analyses, both correct and incorrect responses were included in accuracy analyses. As above, all response times faster than 200 ms were removed (removing 168 trials, or 1.75% of the remaining data). A generalized linear mixed-effects model was used to predict correct responses on a trial-by-trial basis. Fixed effects in the accuracy model matched those in the response time model detailed above: block (factor-coded, with the first Consistent block serving as the default), Gold-MSI musical training score (mean-centered), and their interaction. The random effects structure was built iteratively as detailed above, arriving at a final structure of random slopes for block and random intercepts for participants. Model coefficients are listed in Table 1B.

The first analysis examined block-level context. Relative to the first Consistent block (estimated marginal mean accuracy = 0.98, SE = 0.01), accuracy decreased sharply in the Reversed block (estimated marginal mean accuracy = 0.79, SE = 0.07) but did not differ across the first and final Consistent blocks (estimated marginal mean accuracy = 0.98, SE = 0.01). Additional contrasts were tested by recoding the block factor using the Reversed block as the default level and rerunning the model. Accuracy significantly increased from the Reversed block to the final Consistent block ( = 2.30, Z = 5.04, p < .0001).

The second analysis examined longer-term context. Gold-MSI scores for each participant are illustrated in Fig 2b, with brighter colors indicating higher scores on the musical training subscale. These scores were significant predictors of response accuracy in each block, with higher musical training scores corresponding to more accurate performance (first Consistent block: = 0.05, Z = 2.14, p < .05; Reversed block: = 0.17, Z = 4.16, p < .0001; final Consistent block: = 0.10, Z = 3.55, p < .001). Interactions between block and Gold-MSI scores indicate that this relationship contributed significantly more to accuracy in the Reversed block than the first Consistent block.

Discussion.

Listeners exhibited clear perceptual sensitivity to typical patterns of covariation between musical pitch and timbre. This finding coheres with the efficient coding hypothesis [18–19], as performance indicates listeners have learned this prevailing pattern of covariation in the acoustic environment and that performance worsens when sounds violate this relationship. Importantly, these patterns of performance were correlated with listeners’ musical training. This marks an important advance for the efficient coding hypothesis, as its previous applications to auditory perception frequently treated sensitivity to statistical structure in the sensory environment as uniform. Here, this sensitivity was related to listener factors such as relevant perceptual experience. Listeners reporting more musical training might have had more experience hearing reversed pairings of pitch and timbre (and thus performed more accurately on these trials) than listeners reporting less musical training (and thus performed less accurately on these trials). While the relationship between musical training and task performance is correlational and not causal, it nevertheless highlights graded sensitivity to stimulus statistical structure across a listener sample. These points are revisited in the General Discussion.

Perceptual sensitivity to pitch-timbre covariance was affected by context both at the immediate level across blocks and linked to the longer context of participants’ musical training. Here we unify the response time and accuracy results and discuss the shorter block context followed by the longer experience context. At the block level, the first (Consistent) block established baseline levels of performance in terms of mean response time and (very high) mean accuracy. Despite having one block’s worth of experience in the task, in the second block, accuracy dropped sharply and response times increased numerically but not significantly. This indicates the challenge of labeling pitches when timbre is varying in a manner that violates their covariance (but, as the next paragraph will discuss, this challenge was not equal for all participants). These results build upon previous studies of pitch-timbre interference [7–10,15]: timbre varied throughout the pitch labeling task, but only when timbre varied in contrast with its typical covariance with pitch was perception challenged. Finally, when stimuli returned to their common pairings of pitch and timbre in the final block, response times were fastest overall and accuracy returned to ceiling levels.

The longer timescale of perceptual context considered was the perceptual expertise from musical training. Numerous studies have demonstrated superior pitch perception for listeners with considerable musical training as compared to listeners without musical training (e.g., [65–67]). Here, listeners were recruited irrespective of their musical backgrounds, which varied continuously as assessed by the musical training subscale of the Gold-MSI [42]. Gold-MSI scores were not predictive of response times, but were significant predictors of accuracy in every block, with a stronger association in the Reversed block than in the first Consistent block. Gold-MSI scores also shed light on the mean accuracies in the Reversed block being broadly split into two groups: higher performance (>80%) or poorer performance (≤50%; Fig 2b). Many of the participants with more musical training performed better in this block (mean Gold-MSI score for listeners in the higher-performing group = 26.67) whereas many of the participants with less-to-no musical training performed more poorly (mean Gold-MSI score for listeners in the poorer-performing group = 15.00; Welch’s two-sample t-test: t_28.71= 3.78, p < .001). This finding supports previous results where listeners without musical training were more apt to confuse pitch and timbre than listeners with musical training [9,38,39]. While inconsistent use of pitch versus brightness information in responses would result in near-chance accuracy (50%), several participants performed markedly below chance, which can only be achieved by labeling the timbre of each note as low or high (when it was in fact dark or bright) rather than its pitch, as instructed. This confusion was not due to an inability to perform the task, as Fig 2b illustrates good-to-excellent performance for these participants in the Consistent blocks preceding and following the Reversed block. The significant influences of Gold-MSI scores on accuracy in the Consistent blocks suggest that the benefits of musical training are relatively global for this task, and not limited to cases where pitch and timbre are heard in combinations that violate their typical covariance.

In addition to adopting its experimental paradigm, the present results bear other similarities to dimension-based statistical learning [61,62,68]. In those studies, listeners categorized speech sounds based on whether two acoustic cues were presented in their typical (canonical) or atypical (reversed) combinations. Both in dimension-based statistical learning and here, listeners’ responses were primarily reliant on a dominant acoustic property within the pattern of covariance (here, pitch, as it was highly germane to the pitch labeling task) while also exhibiting sensitivity to a secondary property (timbre). Listeners exhibited rapid perceptual adjustments when the relationship between stimulus properties changed (moving from the Consistent block to the Reversed block or vice versa). But, key differences across paradigms must also be noted. Listeners produced objectively correct or incorrect responses in the pitch labeling task whereas dimension-based statistical learning assesses categorization where accuracy does not necessarily apply, instead measuring perceptual cue weights or how often a particular response was provided. Second, while both paradigms involve presenting stimuli with covarying stimulus dimensions, the relationship between the dimensions differ. Here, pitch is essential to the perceptual task and timbre variation is technically irrelevant; in dimension-based statistical learning, the acoustic redundancy of speech results in either acoustic dimension being sufficient and neither being necessary for producing a response.

While poorer performance in the Reversed block is being attributed to those stimuli violating typical patterns of pitch-timbre covariance, an alternative explanation is possible. Rather than following natural signal statistics, perception could have followed a simpler pattern of superior performance on stimuli heard in the first block, then poorer performance on new stimuli heard in the second block (when pitch-timbre pairings were reversed). One way to disentangle these interpretations is to reverse the order of testing blocks, as has been done in studies of dimension-based statistical learning [61,62,68]. If perception is following the natural stimulus statistics of pitch-timbre covariance, then patterns of performance should also invert, being poorer in the first and final blocks of the experiment (Reversed) relative to the second block (Consistent). If perception is instead following low-level stimulus order effects, then the pattern of results in Experiment 1 should be replicated (good performance followed by poor performance then a return to good performance). This was one of the principal motivations behind conducting Experiment 2.

A second motivation for inverting the order of test blocks was to examine performance on a new timescale of perceptual context. Experiment 2 maintains the same opportunities to examine perception on block-level and long-term timescales of context while also examining performance at the level of the experimental session more broadly. In Experiment 1, perception of Reversed stimuli in the second block was influenced by not only their (violation of) stimulus statistics, but also relative to having labeled Consistent stimuli in the first block. Testing Consistent and Reversed blocks in a different order provides an examination of how performance varies as a function of session context (that is, by being tested first and not having heard any other stimuli yet versus by being tested second and being influenced by having just heard the first block of stimuli as in Experiment 1).

In Experiment 2, block order was changed such that Reversed stimuli were tested in the first and third blocks while Consistent stimuli were tested in the second block only. Similar to Experiment 1, the base prediction of superior pitch labeling for Consistent stimuli still holds. Given that the amount of testing with Reversed stimuli has doubled, performance on these trials is predicted to significantly improve in the third block relative to the first block. Two possibilities exist for how performance with Reversed stimuli in Experiment 2 will compare to that of Experiment 1. If Experiment 1 performance was buoyed by task practice effects, then performance in the first Reversed block of Experiment 2 will be comparatively poorer. Conversely, if testing Consistent stimuli first in Experiment 1 primed listeners on typical pitch-timbre covariance and thus suppressed their performance with Reversed stimuli in the second block, then performance in the first Reversed block of Experiment 2 will improve. As both possibilities are plausible, the results of Experiment 2 will reveal which has more explanatory power.

Experiment 2

Procedure

The procedure matched that of Experiment 1 but with one change. In Experiment 1, blocks in the main task were tested in the order Consistent – Reversed – Consistent. In Experiment 2, this order was inverted, testing blocks in the order Reversed – Consistent – Reversed.