Neural modelling of the encoding of fast frequency modulation

Frequency modulation (FM) is a basic constituent of vocalisation in many animals as well as in humans. In human speech, short rising and falling FM-sweeps of around 50 ms duration, called formant transitions, characterise individual speech sounds. There are two representations of FM in the ascending auditory pathway: a spectral representation, holding the instantaneous frequency of the stimuli; and a sweep representation, consisting of neurons that respond selectively to FM direction. To-date computational models use feedforward mechanisms to explain FM encoding. However, from neuroanatomy we know that there are massive feedback projections in the auditory pathway. Here, we found that a classical FM-sweep perceptual effect, the sweep pitch shift, cannot be explained by standard feedforward processing models. We hypothesised that the sweep pitch shift is caused by a predictive feedback mechanism. To test this hypothesis, we developed a novel model of FM encoding incorporating a predictive interaction between the sweep and the spectral representation. The model was designed to encode sweeps of the duration, modulation rate, and modulation shape of formant transitions. It fully accounted for experimental data that we acquired in a perceptual experiment with human participants as well as previously published experimental results. We also designed a new class of stimuli for a second perceptual experiment to further validate the model. Combined, our results indicate that predictive interaction between the frequency encoding and direction encoding neural representations plays an important role in the neural processing of FM. In the brain, this mechanism is likely to occur at early stages of the processing hierarchy.


Introduction
Frequency modulation (FM) is a basic acoustic feature of animal vocalisation, human speech, and music. In human speech, consonants preceding and following a vowel can be acoustically characterised by formant transitions: a series of simultaneous fast FM sinusoids of around 50 ms duration that start or finish in the frequencies characterising the vowel [1]. At all stages of the ascending auditory pathway, FM is represented along the tonotopic axis in a spectral representation that encodes the instantaneous frequency of the stimuli [2]. Individual neurons at higher levels of the processing hierarchy (inferior colliculus [3][4][5], medial geniculate body [6,7], and auditory cortex [8][9][10][11]) also encode FM direction and FM rate, by responding selectively to certain rates and direction. We call this latter, more abstract representation, the sweep representation.
Despite the massive feedback projections that characterise the auditory pathway [12,13], computational models to date use only feedforward mechanisms to explain FM encoding [14]. Given the importance of high-order predictive elements in the optimisation of speech (e.g., [15]) and FM [16] recognition, descending projections are likely to play an important role for the encoding of fast FM-sweeps in the auditory system. However, to date there is no comprehensive model of such fast FM encoding that incorporates both, the sweep and the spectral representations, and describes the potential sweep-to-spectral feedback mechanisms active during the processing of FM sounds. The aim of the present study was to develop such a model, with a focus on FM-sweeps of the duration and frequency span of formant transitions.
To do that we harnessed a classical behavioural effect from psychoacoustics first reported around 60 years ago [17], which we will refer to as the sweep pitch shift. In the original experiment, participants listened to fast rising and falling FM-sweeps. The authors discovered that the participants judged up sweeps as eliciting a higher pitch than down sweeps with the same average fundamental frequency. These findings were later replicated [18,19], although a reliable quantitative assessment of the phenomenon using stimuli with a controlled spectrum is lacking. To explain the effect d'Alessandro and colleagues proposed a phenomenological model assuming that the pitch of a sweep is integrated using a fixed-size window from the instantaneous frequency of the stimulus across time [20,21]. Due to the leaking memory of the integration, this process naturally favours the latest frequencies of the sweep, explaining the sweep pitch shift. However, the authors found that different integration weights were necessary to explain different partitions of their data, indicating that the phenomenological model is not a parsimonious explanation of the sweep pitch shift.
Using the classical behavioural effect we approached the development of a comprehensive FM-encoding model in three steps. First, we re-examined and quantified the sweep pitch shift in a behavioural experiment, and tested whether the experimental data could be explained by existing computational models [22][23][24]. We found that mechanistic models of pitch processing that attempt to describe the circuitry underlying perception rather than perceptual phenomena [25] could not explain the sweep pitch shift. Current models of FM encoding [14] consider a static representation of spectral information, and thus they predict that the sweep pitch shift would not occur. Thus neither existing models of pitch processing nor FM-encoding could explain the sweep pitch shift. In a second step we built a hierarchical model motivated by the hypothesis that the sweep pitch shift results from the modulation exerted by feedback projections between the sweep and spectral representations. The feedforward components of the model were based on results of previous studies on FM direction selectivity and included processing of instantaneous frequency and processing of FM direction [10,14,26]. The feedback architecture was based on generative hierarchical models and predictive coding [27,28] and informed by the human psychophysics results from the first part of the study. In the third and last step, we used a new set of stimuli termed sweep trains to further validate the model. These stimuli, consisting of a concatenation of five FM-sweeps, preserve the same acoustical features of the original FM-sweeps but elicited different dynamics in the feedback system of the model than their single-sweep counterpart. The ability of the model to predict the pitch elicited by these new stimuli illustrated that the feedback mechanisms proposed in this work, and not bottom-up acoustical features of the stimuli, are the driver of the sweep pitch shift.

The sweep pitch shift revisited
First, we re-examined and quantified the sweep pitch shift, measured as the difference between the perceived pitch and the average frequency of the sweep: Dp ¼ f perceived À � f . Eight participants matched a total of 30 fast FM sweeps with frequency spans Δf 2 [−600, 600] Hz and 3 average frequencies � f 2 f900; 1200; 1500g Hz (see Methods). Each sweep had a duration of 40 ms and was preceded and followed by 5 ms of constant frequency. Participants' task was to match each sweep to a pure tone, which frequency was used to determine the elicited pitch of the sweep.
The pitch shift Δp depended on the sweep's span Δf (Fig 1A and S1 Table). The exact dependence was consistent across participants for sweeps with Δf � 333 Hz lying in the vicinity of the linear fit f perceived ' � f þ m Df . There was an average deviance from the fit of 46 Hz. Sweeps with larger frequency spans resulted in wider distributions of f perceived (Pearson's r = 0.48, p < 10 −14 ; Fig 1B); all subjects showed the same sweep pitch shift direction and comparable orders of magnitude on the dependence of Δp with Δf (S1 Fig). Presenting the sweep before or after the probe tone did not systematically affect the perceived pitch (S2 Fig). Raw data is available in an external repository (github.com/qtabs/fmPitch).
In their classical study, Brady and colleagues [17] showed that the absolute value of the sweep pitch shift |Δp| is larger for down than for up sweeps. In a later study, Nabelek and colleagues [18] showed the reversed effect. To test if our data replicates any of these previous findings, or shows no up/down asymmetry at all, we drew, for each absolute frequency span |Δf|, the distribution of the differences between the pitch shift in up and down sweeps: Our results robustly replicated the observations from Nabelek and colleagues ( Fig 1C). The sweep pitch shift was significantly larger for up than down sweeps for |Δf | � 200Hz (p < 2 × 10 −5 ) but not for |Δf | = 66Hz (p = 0.77), according to two-tailed rank-sum tests (number of samples N = 96).
Last, we tested if the dependence of the sweep pitch with Δf was robustly replicated across subjects. The slopes of the linear fits between f perceived and Δf were of similar magnitude in all participants (mean slope m = 0.38, standard deviation across subjects σ m = 0.07, a 18.5% of the nominal value, corresponding to an effect size of d = 5.4; see S1 Fig).

Bottom-up models of pitch do not explain the sweep pitch shift
Current theories of pitch suggest that two complimentary codes of pitch coexist in the auditory system: the spectral or place code, produced by the spectral decomposition of the stimuli performed by the basilar membrane; and the temporal code, comprised in the spike timings of the neurons across the auditory pathway that are phase locked to the stimulus waveform (see [29] for a review). If the sweep pitch shift was a consequence of bottom-up pitch processing, we would expect the effect to be explainable by computational models that Each sample in the distributions corresponds to the standard deviation of the perceived pitch of a sweep in one subject (i.e., in each distribution there are 8 × 3 points, one for each subject and � f ). The standard deviation is monotonically correlated to the absolute span |Δf | (Pearson's r = 0.48, p < 10 −14 ). C) Kernel density estimations of the up/down asymmetry asymm "# distributions as defined in Eq (1). Each sample of the distributions corresponds to the difference of the average absolute deviation from centre frequency between up and down sweeps of the same |Δf | for a given subject and centre frequency (N = 8 × 3 = 24). Red crosses show the mean and the standard error of the data. https://doi.org/10.1371/journal.pcbi.1008787.g001

PLOS COMPUTATIONAL BIOLOGY
use either of the two representations to infer pitch. To test this we computed the pitch predicted by one representative model of each family; i.e., one model using the spectral and one model using the temporal codes. Although the possibility that the auditory system integrates information from both codes has been theoretised (e.g., [25]) and implemented [30] before, the combination of both codes has so far proposed to be purely additive. Thus, a combined spectro-temporal approach could only explain the sweep pitch shift if at least one of the two codes shows a positive pitch shift for up sweeps (Δf > 0) and a negative pitch shift for down sweeps (Δf < 0). In the spectral model, pitch can be directly inferred by computing the expected value of the activity across cochlear channels in the auditory nerve [22,25]. Predictions of the spectral model approximate the empirical data for Δf * 0. Unlike the empirical data, however, predictions of the spectral model show no systematic dependence of f perceived on Δf (Fig 2A). More sophisticated spectral models designed to explain how the pitch of harmonic complex tones is encoded [25] would yield identical results because the sinusoidal FM-sweeps used in the present experiment evoke a single peak in the spectral distribution.
The temporal model was based on the principles of the summary autocorrelation function (SACF), that measures pitch according to the phase-locked response in the auditory pathway [23,31]. We chose this model because it performs a relatively straightforward analysis of phase-locked activity. As in the spectral model, predictions of the SACF approximate the empirical data for Δf * 0, but do not show the dependence with Δf observed in the data ( Fig 2B). This is most likely a consequence of the 2.5 ms integration time window for phaselocked activity in the auditory system [32] being too large to integrate the rapidly changing frequencies of our stimuli (up to 15 Hz/ms).
We selected these two representative models of pitch processing because they keep the largest possible amount of information from the peripheral system. Our reasoning is that, if the sweep pitch shift cannot be derived from a minimally-processed code extracted from the peripheral output, it is very unlikely that it can be derived from any further bottom-up processing of this information.

The FM-feedback spectral model
In this section we introduce a hierarchical model of FM-encoding in the auditory system, termed FM-feedback spectral model, with two levels (Fig 3). In the first level, the spectral layer holds a spectral representation of the sound. In the second level, the sweep layer encodes FMsweep direction. The spectral layer uses the spectral rather than the temporal code to represent The spectral layer integrates afferent inputs from the auditory periphery and encodes a representation of the stimulus that can be used to infer pitch. The sweep layer receives afferent inputs from the spectral layer that are used to decode the direction of the sweeps. Feedback connections from the sweep layer to the spectral layer modulate the time constants of the populations that are expected to be activated once the direction of the sweep has been decoded. The inhibitory ensembles in the up and down network enforce competition between up and down ensembles in a winnertake-all fashion. Note that the diagram is schematic and shows only 5 of the N = 100 populations and a single example of the connections between the sweep and the spectral layers. The labels of the boxes of the peripheral system are also schematic: the spectral resolution of the peripheral system is much higher. See Methods for specific details on the mathematical formulation of the model. the instantaneous frequency of the stimuli because the integration window of the auditory system is too large to integrate the rapidly changing frequencies of the sweeps used in our experiment [32]. The animal electrophysiology literature also converges in the notion that sweep direction and rate are decoded from the spectral, and not the temporal representation of the sounds [4,6,7,11,14,33].
The main hypothesis introduced in the FM-feedback spectral model is that, once the direction of the sweep is encoded in the sweep layer, a feedback mechanism modulates the effective time constant of the populations encoding the frequencies that are expected to be activated in the next instant in the spectral layer. We expect this mechanism to qualitatively explain why the posterior parts of the sweep are given a higher weight during perceptual integration and to quantitatively reproduce the exact dependence of pitch with Δf observed in our data. An implementation of the FM-feedback spectral model written in python is freely available at github.com/qtabs/fmPitch.
Example responses of the excitatory populations of the model to up and down sweeps are shown in Fig 4. Modelling FM direction selectivity. We modelled FM direction selectivity using the principles of delayed excitation, a mechanism where neurons with different best frequencies output to the direction selective neuron with different delays [6,10,14,26]. This mechanism introduced consistent delays between the populations in the spectral and the sweep layers. A sweep population receiving direct input from the spectral population encoding f 0 and responding selectively to up sweeps will receive increasingly delayed inputs from the spectral populations centred at f < f 0 (Fig 3). The relative delay in the connection between a spectral population m and a target sweep population n depends linearly on the spectral distance between the two ensembles: δt nm = |n − m|δt 0 . Although this configuration is optimal for linear sweeps with slopes ' δt 0 /(f n − f n−1 ), adding parallel replicas of these populations with varying δt 0 would suffice to generalise the mechanism to a wider range of speeds and to non-linear sweeps. Populations selectively responding to specific rates have been reported in bats [34][35][36][37] and rodents [3,8,9].
The sweep layer consists of two networks, each encoding one of the FM directions and responding selectively to up (") and down (#) sweeps. Each of the networks consists of N columns, each comprising an excitatory and an inhibitory population (Fig 3). Note that populations in the sweep network also have a best-frequency and they are thus arranged according to their corresponding cochlear channel: an up population in the sweep network responds selectively to up sweeps when these span through a certain frequency range (see Fig 4B).
To quantify direction selectivity, we used the standard direction selectivity index (DSI; e.g., [11]), defined as the proportion of the activity elicited in a network by an up sweep minus the activity elicited in the same network by a down sweep with the same duration and frequency span. An ideal network responding selectively to up sweeps will have a DSI = + 1 and an ideal network responding selectively to down sweeps will have a DSI = −1. Similar DSI magnitudes are measured in the down and the up network ( Fig 5). Network selectivity to FM direction was robust to variations of around 20% of the fitted value of the main parameters of the model pertaining direction selectivity (δt 0 , and the conductivities and dispersion in the connectivity matrices of the bottom-up connections). Deactivation of the feedback connections, however, resulted in a (16 ± 1.4)% average decrease in absolute DSI, indicating that the feedback connections sharpened direction selectivity.
Although we did not attempt to model FM rate selectivity, the DSIs monotonically increased with |Δf | (Fig 5), a property that could be exploited in further developments of the model to encode modulation rate [3,7,9].
Predictive mechanisms. Once neurons in the sweep layer encoded the sweep direction, feedback connections targeting the spectral layer applied currents that facilitated the encoding of expected frequencies. We will call them facilitation currents in the following. Let i be the population in the up-sweep network receiving inputs from a population in the spectral layer encoding a certain frequency f 0 . Due to delayed excitation, the population i becomes active when it detects an up sweep occurring in the neighbourhood of frequencies f � f 0 . Although when f 0 is the ending frequency of the sweep the following frequencies will not activate next, most often f 0 will be just an intermediate frequency within the sweep. Thus, activation of i would imply that populations in the spectral layer with best frequencies immediately higher than f 0 are likely to activate next. The facilitation currents, encoded in the feedback projections stemming from the sweep layer and targeting the spectral layer, reduce the reaction time of the populations in the spectral layer that are expected to activate next using low-current feedback excitatory signals. Similarly, feedback connections stemming from a population j in the downnetwork that received timely inputs from a spectral population with best frequency f will target populations in the spectral network with best frequencies immediately lower than f 0 .
NMDA receptors are typically responsible for conveying feedback excitatory information in the cerebral cortex [38,39]; specifically, NMDA-deactivation results in a reduced feedback control in the auditory pathway [40]. Thus, while bottom-up drive was modelled using AMPA dynamics, feedback connections were modelled according to NMDA-like synaptic gating dynamics with a finite rising time constant [41]. Feedback current intensity was kept low in comparison to the bottom-up driver by enforcing NMDA conductivity to be much smaller than the AMPA conductivity (i.e., J NMDA � J AMPA ).
The facilitation currents modulated the spectral population that is expected to fire next so that it subtly increased its firing rate with respect to a not modulated population. Due to network effects captured in the mean-field model [42], this subtle activation driven by a low-current effectively reduces the neural population's decaying time constant τ pop (Fig 6), equivalent to a smaller integration time window of a leaky integrator. Endowed with a smaller effective integration time constant, the population integrates the sensory input faster and spends more time in the high-firing-rate regime than a population that has not been facilitated. Since facilitated populations spend more time in the high-firing-rate regime, frequencies expressed in the last part of the sweep have stronger contributions to the probability distribution of pitch. . The dashed purple line shows a trajectory followed by the population when the forward synaptic input from the peripheral layer is plugged in without feedback modulation. In this case, the population reacts slowly to the strong synaptic input, and eventually converges to equilibrium. The dotted lines (orange and red) show the trajectory of the same population in the presence of feedback modulation (i.e., the facilitation currents). The low-current feedback excitatory signals drive the population to a state with a low effective time constant without substantially increasing its firing rate (orange section of the trajectory). When the strong synaptic input from the auditory periphery is switched on (red section of the trajectory) the population reacts quickly to the synaptic input, reaching equilibrium much faster than in the nonmodulated case. https://doi.org/10.1371/journal.pcbi.1008787.g006 Stimuli with constant frequency (e.g., pure tones) do not drive any of the sweep populations and thus do not activate any feedback mechanisms. Therefore, in the absence of FM, the model reduces to a purely bottom-up spectral model.
Reproduction of the sweep pitch shift by the FM-feedback spectral model. The FMfeedback spectral model explains R 2 = 0.97 of the variance of the experimental data ( Fig 7A). Moreover, there was a significant correlation between the variance of the model responses and the standard error of the experimental data (r p = 0.63, p < 10 −10 ), indicating that the larger variability in the sweep pitch shift observed for the larger Δf can be understood as a consequence of a wider spread activation across the spectral populations.
Up sweeps partially compensate for the differential delay in the basilar membrane responses to low frequencies with respect to high frequencies, provoking higher synchronisation in the auditory nerve [43]. Stronger peak activities result on slightly higher facilitation currents for up than for down sweeps, causing a noticeable stronger absolute mean pitch shift for up than for down sweeps, as observed in the experimental data ( Fig 7B). Note that this is not the result of the model overfitting the data, since the average error of the model fit (E[error] ' 1 channel ' 50 Hz) is of the same order of magnitude as the up-down asymmetry E[asymm "# ] ' 100 Hz.
To study the dependence of the model fit with the model's parameters we recomputed the explained variance R 2 across the parameter space of the model (Fig 8). The model explained the experimental data in a wide section of the parameter space, with an average R 2 across a  To show that the fit of the model was not simply caused by an overall stronger activation provoked by the facilitation currents, but by a decrease in the effective time constant of the populations, we also computed the dependence of R 2 with the conductivity of the feedback J NMDA while keeping the population time constant τ fixed to τ = τ memb (see Methods). Even considering lower τ memb than the physiologically valid nominal value τ memb = 20 ms, without an adaptive τ, much stronger NMDA currents (J NMDA * J AMPA ) are necessary to shift the peak of the distribution of the responses across the spectral layer towards the experimental results.
Reproduction of previous experimental results. We tested whether the FM-feedback spectral model was able to predict the pitch shift of additional data. For this, we chose the stimuli of Brady and colleagues [17], because this was the only study that investigated the dependence of the sweep pitch shift with properties different than Δf. Specifically, in the experiment II they considered FM-sweeps with a fixed 20 ms transition between 1000 Hz and 1500 Hz that was located at six different positions within a 90 ms stimulus (see schematics in Fig 9A, left). In the experiment III, they used FM-sweeps in the same Δf but with transitions of six different durations (see schematics in Fig 9A, right). All stimuli had the same duration (90 ms) and frequency span (1000-1500 Hz); in each of the two experiments there was a total of 12 stimuli (six up, six down).
We compared the predictions of the FM-feedback spectral model with the experimental results reported in the original paper ( Fig 9B). The experimental trend is well reproduced by the model (R 2 = 0.49).

Testing the FM-feedback spectral model with a new class of stimuli
The results described so far are in favour of the hypothesis that there is a feedback system between populations of the spectral and sweep representations that has strong repercussions on perceptual behaviour. Next, to validate these findings, we introduced a new set of stimuli specifically designed to contest the main hypothesis of the model. The new stimuli consist of concatenations of several single sweeps with the same properties as the stimuli used in the first experiment. We call them sweep trains in the following. Sweep trains present the same  Table 1. The two leftmost plots show the dependence of R 2 with the conductivity of the feedback connections and the dynamics of the excitatory population time constants. Different values of the nominal population's time constant were used to illustrate that the dynamic effect (rather than the resulting shorter time constant) is crucial to explain the experimental results; however, during the parameter tuning the temporal constant was constrained to τ memb = 20 ms based on physiological observations [44]. The rightmost plot shows the dependence of R 2 on the width and reach (w ωs and Δ ωs , respectively; see Methods) of the feedback connections. Black crosses in the parameter space signal the final parametrisation.
https://doi.org/10.1371/journal.pcbi.1008787.g008 acoustical properties as the single sweeps used in the first behavioural experiment and should nominally elicit the same pitch percept as their single-sweep subcomponents. However, the FM-feedback spectral model predicts that the feedback system will only reduce the time constant of the spectral populations during the processing of the first sweep in the train, because they will already have an elevated firing rate (and thus a low effective time constant) during the processing of the subsequent sweeps in the train. Consequently, the model predicts that the sweep trains will elicit a much more subtle pitch shift than their single sweep counterparts. We tested this prediction in a perceptual experiment analogous to the first experiment.

Sweep trains show minimal sweep pitch shift
Sweep trains were constructed using the sweeps from experiment 1. To ensure that each train was perceived as a single auditory object, we only used sweeps with |Δf | � 333 Hz, resulting in a total of 3 × 6 = 18 stimuli. As in the results from Experiment 1 (Fig 1), the magnitude of the pitch shift in sweep trains depended on Δf (Fig 10 and S1 Table, bottom). However, as qualitatively predicted by the FM-feedback spectral model, the effect sizes of the correlation were lower than in the single-sweep experiment (cf., S1 Table, top). Data also showed higher variability than in experiment 1 (S1 Fig). After completing the experiment, some participants reported in informal conversation that the sweep train stimuli were harder to match than the single-sweep counterparts. Although trains with small Δf were generally perceived as continuous tones, subjects reported that a few trains (putatively those with the largest Δf) elicited a ringing-phone-like percept. Stimuli are available in the supporting information (S1 Sounds).
Sweep-train stimuli show only a subtle up/down asymmetry that did not reach statistical significance even for the larger Δf (p = 0.67, p = 0.96, p = 0.36 for |Δf | = 333, |Δf | = 200, |Δf | = 66, respectively; according to two-sided Wilcoxon signed rank tests with 24 samples per condition). The FM-feedback spectral model explains the diminished pitch shift in the sweep trains Next, we assessed the ability of the FM-feedback spectral model to quantitatively explain the effect size of the pitch shift observed in the sweep trains. The fit with the experimental data was comparable to that of the single sweep stimuli: the model explained R 2 = 0.99 of the variance of the data (Fig 11A). As in the first experiment, the standard deviations of the experimental data was strongly correlated to the width of the model responses (r p = 0.75, p < 10 −10 ; Fig 11B).
Last, we tested whether the different up/down asymmetry (asymm "# ) observed in the single sweeps and sweep train data could be quantitatively explained by the FM-feedback spectral model. In the single-sweep data, the model predicts a stronger sweep pitch shift magnitude |Δp| for up sweeps (Fig 7C) because these elicit a stronger peak activation in the auditory nerve [45], resulting in stronger facilitation currents. Qualitatively, a much weaker asymmetry was expected in the sweep-train data, since the spectral populations have already high firing rates (and thus low effective integration time constants) during the processing of the ending four fifths of the stimuli. In sweep trains, then, only the first sweep contributes to the sweep pitch shift, whereas the remaining sweeps provide equal contributions to the range of spanned frequencies, diluting the shift magnitude by four fifths. Modelling results on the up/down asymmetry closely reproduced the empirical data (Fig 11C), fully explaining the observed differences between the two classes of stimuli.

Discussion
In this work we have built a novel model that describes how feedback projections between the two different known representations of FM (i.e., spectral and sweep) could be used in the brain to facilitate encoding. This contrasts with the classical view of FM encoding as a bottomup process [14]. The feedback mechanism proposed in this work uses predictions generated by populations encoding FM direction to aid encoding in populations encoding instantaneous frequency, enhancing direction selectivity and shortening FM processing time. Since this predictive facilitation is not intrinsically restricted to the fast-FM characteristic of formant transitions in speech, similar facilitation mechanisms could also boost encoding efficiency for the slower FM underlying the perception of prosody and melody.
In this work we have used the model to encode sinusoidal (pure-tone) FM stimuli that are far from the complexity of natural speech sounds. In speech, phonemes are characterised by concurrent formant transitions that span complementary frequency ranges. Moreover, each of these formant transitions are carried by harmonic complex tones rather than pure tones. It is currently unclear how the mechanisms introduced here could be used in natural settings to encode phonemes. One possibility is that sinusoidal sweeps are first decoded in the sweep layer and integrated in a later step of the ascending hierarchy. Since the populations encoding direction selectivity in the FM-feedback spectral model are tuned to specific best frequencies, the sweep layer is potentially able to encode simultaneous sweeps spanning complementary frequency ranges and represented by parallel harmonic series. Neural populations in the sweep layer could output to a third level of abstraction where combinations of concurrent sweeps across harmonic series are mapped into phonemes. Therefore, the FM-feedback spectral model could be the first basic building block towards a more comprehensive understanding of speech processing.

Bottom-up pitch models and pitch codes
The bottom-up integration of the spectral representation, cornerstone of the classical spectral or place theories of pitch [46], predicted a null sweep pitch shift. Other attempts of bottom-up models have also failed to parsimoniously explain the pitch shift: A previous phenomenological model suggested that a leaky integration of the instantaneous frequency could result in the ending segments of the sweeps having a stronger weight in the pitch decision, which would qualitatively explain the direction of the pitch sweep shift [21]. However, this model predicts the same pitch shift in single sweeps than sweep trains and the same absolute pitch shift in up and down sweeps, in direct contradiction with the empirical data.
Our simulations showed that current bottom-up modelling approaches based on a temporal code cannot even extract robust pitch information from most FM-sweeps used in the experiments. This is most likely a consequence of the fast change rate in the periodicities of fast FM stimuli. Typically, pitch decisions based on the auditory nerve temporal code are made after integrating over four cycles of the period of the stimuli [32,47], coinciding with the duration threshold for accurate pitch discrimination [48]. However, our stimuli presented an average change of *25 Hz across four repetitions of their average frequency, making this integration virtually impossible.
Another possibility is that a combination of the temporal and spectral codes is used to process the pitch of FM-sweeps, and that the sweep pitch shift emerges from this integration. Both, spectral and temporal representations of pitch play different roles in pitch processing, and it has been previously suggested that both codes could be added to perform pitch decisions (see [25] for a review). However, the temporal code had no usable pitch information at |Δp| > 200 Hz that could be integrated with the spectral code.
Although it is methodologically impossible to explore and anticipate the space of all possible models of pitch processing and more sophisticated bottom-up mechanisms might theoretically suffice to explain the sweep pitch shift in sweeps and sweep trains, the feedback modulation mechanism introduced in this study is, to date, the only available account of the experimental data.

Relation to predictive coding and hierarchical processing strategies
The presence of predictive feedback modulation in the subcortical sensory pathway has been shown before in humans [49,50] and non-human mammals [51]. Previous studies often interpreted it in the context of the predictive coding framework [27,28,52], a theory of sensory processing that postulates that sensory information is encoded as prediction error; i.e., that neural activity at a given level of the processing hierarchy encodes the residuals of the sensory input with respect to a generative model encoded higher in the hierarchy.
The FM-feedback spectral model can also be understood in the light of this formalism: it presents three hierarchical layers of abstraction: the inputs from the peripheral system, the spectral layer, and the sweep layer. The top layer performs predictions on the sensory input incoming at the immediately lower representation of the hierarchy. However, unlike the classical predictive coding microcircuit where the generative model used to perform predictions and prediction error are kept in separate neural ensembles [53], the sweep network simultaneously holds a representation that is both, descriptive for its own representation and predictive for the immediately lower representation in the hierarchy.
Combining the generative model and stimulus representations in the same neural code solves some of the open questions of classical predictive coding architectures recently summarised by Denham and Winkler [54]: i) "what precisely is meant by prediction?", ii) "which generative models [within the hierarchy] make the predictions?", and iii) "what within the predictive framework is proposed to correlate with perceptual experience?". In the FM-feedback spectral model, the predictions can be summarised as the probability distribution of patterns of activation expected to come next in the lower level given what has been encoded so far in the higher level. These conditional probability distributions are encoded in the feedback connections stemming from the neurons holding the high-level representation and targeting the neurons holding the lower level representations. Such connectivity patterns would represent the statistics between the representations in the two levels, and could have arisen naturally during development after sufficient exposure to the stimuli. The perceptual experience in the FM-feedback spectral model is encoded in the activation along the two hierarchical stages, which encode different aspects of the stimuli.
Another key difference between the FM-feedback spectral model's architecture and the classical predictive coding microcircuit is that, rather than encoding the residuals of the spectral representation with respect to the FM-sweep representation, neurons in the spectral layer simply encode the spectral content of the stimulus. However, since the decoding of the predictable parts of the stimuli is faster, predictability potentially ensues a significant decrease on the amount of signal produced during the encoding. Such mechanism would explain why even expected stimuli, for which the residual should theoretically be zero, do still evoke measurable responses (e.g., [51]).

Comparison with previous measurements of the sweep pitch shift
Our experimental findings qualitatively replicated the sweep pitch shift found in previous studies; namely, we found that the pitch elicited by FM-sweeps was biased towards the frequencies spanned in the ending part of the sweeps [17], and that the perceptual bias is monotonically related to the frequency span Δf [18,19]. On average, we estimated a putative linear relation between the pitch shift Δp and Δf of around m ' 0.38, slightly higher than Brady's [17] (m ' 0.34 with transitions of 50 ms) and Nabelek's [18] (m ' 0.32 with transitions of 40 ms) reports, and significantly higher than Rossi's [19] (m = 1/6 ' 0.17 with transitions of 200 ms) estimation. Since Rossi's transitions were 5 times longer than ours, the estimations are difficult to compare. However, the disagreement seems to indicate that the pitch shift is stronger with shorter durations. This observation would be fully compatible with the mechanism of predictive facilitation described in the FM-feedback spectral model: since the time to decode FM direction is independent of sweep duration, whilst only the most posterior part of the stimulus is facilitated in the short sweeps, in a long sweep the facilitation currents would affect a much larger portion of the sound, potentially including frequencies occurring even before � f .
In their original study Brady and colleagues [17] found that some sweeps of 20 ms duration elicited pitch values that coincided and even exceed the frequency span of the stimuli, especially in up sweeps. This perceptual extrapolation or overshoot has been replicated in two later studies [55,56] and seems to further confirm the idea that the sweep pitch shift is driven by feedback between the sweep and spectral layer. However, the facilitation currents present in our model could not provoke activation of frequencies that are not initially present in the stimulus. One possibility is that the facilitation currents induce activation in neurons encoding frequencies that, although beyond the spectral range spanned by the sweep, are present in the stimulus spectrum. This scenario is unlikely in our data because we used frequency modulated sinusoids that elicit a unimodal distribution of responses in the auditory periphery. However, the three studies that showed the overshoot effect used spectrally rich stimuli. This would also explain why our model predicts smaller effect sizes than Brady's data in Fig 9. To clarify this point further work could either test if, as predicted by our model, sinusoidal sweeps do not produce an overshoot, and if an extension of our model able to handle spectrally rich stimuli does reproduce the overshoot effect shown in previous experimental data.
The FM-feedback spectral model also provides for a mechanistic interpretation of the two additional experiments reported in Brady's original study [17]. In Brady's experiment II, the transient duration of the sweep is kept constant but its onset is varied across the stimulus duration. When the transient is located near the beginning of the stimulus, the greatest part of the sounds excites neurons encoding frequencies close to the posterior parts of the transient pushing the distribution of the responses away from the average frequency towards the ending frequencies of the sweep f 1 . This shift is larger than expected for a sound without a transient because of the feedback modulation of the later frequencies exerted by the sweep network. When the transient is located at the very end of the stimulus, the longer portion of the stimulus exciting f 0 compensates for the shift in the frequency distribution, bringing the perceived pitch closer to the starting frequencies of the stimulus.
In Brady's experiment III, the transient's onset is kept constant and it is the duration of the transient that is varied. The decreased sweep pitch shift observed for shorter in comparison to longer transient durations can be explained by the FM-feedback spectral model as a consequence of the stimuli presenting a larger segment with the initial frequency, thus shifting the distribution of the responses towards f 0 .

FM encoding and physiological location of the sweep and spectral layers
The earliest neural centre within the auditory pathway showing FM direction selectivity in mammals is the inferior colliculus [3][4][5][6], although thalamic nuclei (medial geniculate body) [6,7] and auditory cortex [4,[8][9][10][11] show generally stronger DSIs. Thus, the sweep layer postulated in the FM-feedback spectral model could be implemented even at early stages of the auditory hierarchy. Similarly, since all the nodes in the ascending auditory pathway contain tonotopically arranged nuclei, the spectral layer could be putatively located as early as in the cochlear nucleus. The physiological location of the mechanisms described here remains an open question.

Conclusion
In this work we have harnessed a well-established perceptual phenomenon to inform a model of FM direction encoding. We have shown that representative bottom-up models of pitch processing do not explain the pitch elicited by fast FM sweeps. Based on neurobiological considerations, we hypothesised that FM direction-selective neurons alter the way that spectral information is encoded via a feedback mechanism. We used the hypothesis to develop a model that proposes how this feedback modulation might be exerted and how it might affect the pitch percept elicited by FM sweeps. Although we cannot logically exclude other potential explanations for the effect, we provide evidence that our hypothesis is a likely and plausible mechanism underlying the encoding of formant transitions. These mechanisms could be part of a larger hierarchical network that transforms formant transitions into phonemes, phonemes into syllables, and syllables into words. Unravelling the fundamental building blocks of this hierarchy is a necessary prerequisite for a comprehensive understanding of the computational mechanisms underlying speech perception in the human brain.

Ethics statement
The study was approved by the ethics committee of the University of Leipzig (ethics approval number 273/14-ff). All participants provided informed verbal consent.

Measuring the sweep pitch shift in single sweeps
Participants. Eight participants (4 female), aged 22 to 31 (average 26.9) years old, were included in the study. They all had normal hearing thresholds between 250 Hz and 8 kHz (<25dB HL) according to pure tone audiometry (Micromate 304, Madsen Electronics). All reported at least five years of musical experience, but none of them was a professional musician. See the section Inclusion criteria bellow.
We considered a sample of N = 8 as sufficient for two reasons. First, the sweep pitch shift has been independently demonstrated in several previous studies [17][18][19]. The first experiment in our study is a replication of these previous studies that allowed us to quantify the sweep pitch shift. Second, low-level psychoacoustic phenomena are typically characterised by small inter-subject variabilities, so not many participants are necessary to demonstrate their generalisability [57]. We have selected 8 participants to ensure that that was the case, but the literature is populated with highly reproducible results that are inferred from experiments performed on populations as small as N = 4 [58]. Ours and previous data confirms that indeed the sweep pitch shift is present and shows the same direction and order of magnitude at the single subject level (S1 Fig; see also Tableau IV in [19] showing the effect in 18 subjects). Both experiments carried out in our study were taxing and extremely long, lasting for up to three hours per subject. Thus, while increasing the sample size would have not resulted in a stronger demonstration of the effect, it would have incurred an unjustified waste of resources.
Stimuli. Stimuli were 50 ms long frequency-modulated sweeps. Frequency was kept constant during the first and final 5 ms of the sweeps. The modulation was asymptotic (i.e., linear in the period T = 1/f space) and carried out in 40 ms. Stimuli were ramped-in and ramped-out with 5 ms Hanning windows overlapping the sections with constant frequency.
There were 30 single sweeps with 10 linearly distributed frequency spans Δf 2 [−600, 600] Hz and 3 average frequencies � f 2 f900; 1200; 1500g Hz. For each sweep with a given Δf and � f , the initial and final frequencies were Sounds were delivered by over-ear headphones Sennheiser (Sennheiser electronic GmbH & Co. KG; Germany) HD201 connected to a Realtek (Realtek Semiconductor Corp.; Taiwan) ALC1150 soundcard. Participants were required to adjust the loudness of the stimuli to a comfortable level during the pure-tone-test (see Inclusion Criteria), so that they had a wide range of pure tones to use as reference. The experiment was carried out in a quiet room. Stimuli were produced and delivered by a custom-made MATLAB (MathWorks, Natik, USA) script. Scripts running the experiment and generating the sounds are freely available in github.com/ qtabs/fmPitch/experiment. Experimental design. Each trial consisted of a sequential presentation of a target sweep and a probe pure tone. After the presentation, the participant was asked whether the second sound evokes a higher, equal, or lower pitch percept than the first sound. Participants were allowed to replay the sounds as many times as needed in case of doubt. After the response, the software adjusted the frequency of the probe tone by steps of ±� = ±25 Hz, bringing the pitch of the sound closer to the participant's percept (e.g., if the participant judged the target sweep as having a lower pitch than the probe tone, the frequency of the probe tone was reduced by 25 Hz). This procedure was repeated until the participant reported that the two sounds evoked the same pitch percept. Then, the frequency of the matched pure tone was stored as the perceived pitch of the sweep reported in that trial, and a new trial with a new target sweep began. The initial frequency of the probe tone was sampled from a Gaussian distribution centred on the average frequency � f of the target sweep.
Each of the 30 sweeps was matched four times, so that there were 120 trials in total in the experiment. The relative order of the probe tone and the target sweep was reversed in half of the trials to assess if presentation order affects the sweep pitch shift. Thus, the experiment can be described as a 10 (10 different frequency spans) × 3 (3 average frequencies) × 2 (probe played first or last) factorial design.
Inclusion criteria. We initially recruited 22 participant candidates, which were screened by a first behavioural test assessing their capacity to match pure tones against pure tones (pure-tone-test), and then by a second behavioural test measuring their consistency when matching sweeps against pure tones (sweep-test). 14 of those 22 participants did not comply with the inclusion criteria: one was unable to match pure tones of the same frequency, 13 were unable to match sweeps against pure tones consistently. Consistency was assessed independently for each subject: i.e., the test did not evaluate whether the subject conformed to the results of other participants or if the subject showed a sweep pitch shift in any direction. The test only served to evaluate whether the subject approximately adjudicated the same pure tones to the same sweep, and was meant to exclude subjects that, due to poor pitch discrimination abilities or lack of motivation, were unable to perform the task.
The pure-tone-test was designed to ensure that participants had understood the task. We used the same experimental design as in the main experiment, including the same frequency step of 25 Hz, but both probe and target consisted of pure tones. During the pure-tone-test, the software provided feedback after each trial informing the participant whether the response was correct or incorrect. The pure-tone-test was divided in batches of six trials, and it concluded when the participant correctly matched the pitch of every trial in one batch. Most participants completed the pure-tone-test in the first batch; the participant excluded during the pure-tone-test failed to provide correct estimates for as many as six batches.
The sweep-test was used to evaluate whether participants could perform self-consistent judgements on the pitch of FM-sweeps. During the sweep-test, participants undertook a block of 12 trials consisting in 4 repetitions of the same 3 sweeps: fDf ¼ 67 Hz; � f ¼ 900 Hzg, fDf ¼ À 200 Hz; � f ¼ 1200 Hzg, and fDf ¼ À 67 Hz; � f ¼ 1500 Hzg. We chose these sweeps to ensure diversity of the sweep properties while keeping |Δf | small enough to ensure that the sweeps would elicit an unequivocal pitch percept according to Hart's law [59]. After the completion of this block, we scored the participant's pitch matching consistency as the inverse of the average of the absolute differences between the reported pitch in each sweep. Participants with an average standard deviation larger than twice the frequency increment step 2� = 50 Hz were excluded from the experiment.
Since Hart's law [59] establishes that the sweeps used during the sweep-test elicit an unequivocal pitch percept, excluded participants were either unable or unwilling to perform consistent pitch judgements on sweeps. The inclusion of participant with inconsistent judgements would have contaminated the data with random guesses that could bias our estimations of the sweep pitch shift towards Δp = 0. Six of the excluded participants reported no previous musical experience; the remaining 8 had at least five years of musical training.
Experimental procedure. The 8 included participants matched the remaining 27 sweeps in four additional blocks. No sweep type was repeated within a single block, and all sweeps were presented 4 times across the entire experiment, resulting in 27 trials per block. The order of the sweeps within each block was randomised and the relative position of the probe tone with respect to the target stimulus was pseudorandomised so that in half of the trials in each block the probe tone was presented before the target sweep. Participants were instructed to take rests between blocks and were allowed to take as many shorter rests between trials as needed. To encourage precision, a 5€ award was offered to participants that kept their selfconsistency along the main experiment with the same criterium as in the evaluation: a smaller standard error than 2� = 50 Hz within each sweep type. Only sweeps expected to yield the most unequivocal pitch sensation according to Hart's law [59] (i.e., |Δf | � 200 Hz) were used to compute the overall self-consistency; participants were however unaware of this. Participants typically completed the experiment within 3 hours.

Measuring the sweep pitch shift in sweep trains
Participants. The same 8 participants who completed the first experiment were invited to repeat the measurements with the new stimuli.
Stimuli. Stimuli were concatenations of 5 sweeps adding up to a total of 250 ms (sweep trains; see Fig 12). The sweeps were taken from a subset of 18 elements from the first experiment with 6 different frequency spans Δf 2 [−333, 333] Hz. To ensure continuity of the stimulus waveform, the sweeps were concatenated in the frequency domain (i.e., we computed the waveform of the stimuli by performing a Fourier transform over the concatenation of the time courses of the instantaneous frequencies). 5 ms Hanning windows were applied only at the very beginning and very end of the sweep trains.
Experimental design. The matching procedure was the same as in the first experiment: the participants matched the pitch of the sweep trains to probe pure tones whose frequency they could adjust with the aid of a computer software. To ensure that there were no effects of stimulus duration, the probe tones had the same duration as the sweep trains (i.e., 250 ms). As in the first experiment, each of the 18 sweep trains was matched four times, so that there were 72 trials in the second experiment. The relative order of the probe tone and the target sweep train was also reversed in half of the trials. Thus, the second experiment can be described as a 6 (different frequency spans) × 3 (average frequencies) × 2 (probe played first or last) factorial design.
Experimental procedure. Since the participants were already familiar with the task and proved to be able to match the pitch of FM-sweeps consistently, the experiment contained no pure-tone-or sweep-test. Four repetitions of the 18 sweep-trains were distributed across 5 blocks following the same principles as described in the first experiment. Participants typically completed the second experiment within 2 hours.

Bottom-up models of pitch
Spectral models of pitch processing. The responses at the auditory nerve were computed with a model of the peripheral auditory system [22,60]. The model's output represents the expected firing rate p n (t) in a fibre of the auditory nerve associated with the nth cochlear channel (n = 1, 2, . . ., N) at an instant t. The frequency range of the cochlear model was discretised in N = 100 channels, spanning frequencies from f min = 125 Hz to f max = 10 kHz.
The perceived pitch corresponded to the expected cochlear channel k, E[k], according to a probability distribution ρ derived from the integral of p n (t) over the duration of To compare the predictions of the model with the experimental data, we also computed the expected channels E[k] associated to pure tones with the frequency of the average perceived pitch of each sweep.
Temporal models of pitch processing. The SACF used in this work follows the original formulation by Meddis and O'Mard [23,24]. Essentially, this model poses the existence of an array of M periodicity detectors responding more saliently to a preferred period δt m . The instantaneous firing rate A m (t) of the mth periodicity detector (m = 1, 2, . . ., M) follows: where the auditory nerve activity p n (t) in the cochlear channel n at an instant t is computed as in the previous section. The characteristic periods δt m are uniformly distributed between δt m = 0.5 ms and δt m = 30 ms, which allows the model to capture periodicities corresponding to frequencies between 2 kHz and 135 Hz up to four lower harmonics. We kept a fixed integration constant t SACF m ¼ 2:5 ms; using variable t SACF m that depend linearly on δt m (see details in [31,61]) did not result in substantial changes in our results.
Stimuli presenting periodicities at a certain frequency f typically elicit peaks of activation in the detectors tuned to the preferred period δt m = 1/f = T 0 and to the periods corresponding to all subsequent lower harmonics δt m = 2T 0 = T 1 , δt m = 3T 0 = T 2 , etc. Thus, evidence for the period T at an instant t, B(t) T can be represented as the BðtÞ T ¼ The perceived pitch corresponded to the expected period T, E[T], according to a probability distribution ρ derived from the integral of B T (t) over the duration of the stimulus L: To compare the predictions of the model with the experimental data, we computed the expected period E[T] associated to pure tones with the frequency of the average perceived pitch of each sweep.

Details on the predictive model of FM encoding
Spectral layer and pitch estimations. The spectral layer consists on an array of N = 100 neural populations that integrate the output of the peripheral model. Neural populations are modelled according to a mean-field derivation [62] on linear integrate-and-fire neurons that, although first formulated to describe dynamics in cortical regions dedicated to visual decision making, has shown a great versatility approximating the dynamics of many different cortical areas (e.g., [63]). We decided to use a simple point model without physiological detail because we do not know exactly the location in the brain of the system we are modelling.
The firing rate h n (t) of the nth ensemble follows the dynamics of a leaky integrator: where ϕ(x) = (cx − I 0 )/(1 − e −g(cx − I 0 ) ) is the transfer function of the mean-field model and τ pop are adaptive time constants: Δ T = 1mV is the size of the spike initialisation of the neural model and t memb e ¼ 20 ms and t memb i ¼ 10 ms [44] are the neural membrane time constants for excitatory and inhibitory populations, respectively. Using adaptive integration time constants makes the populations to react faster to changes when they are marginally active and have weak synaptic inputs, a behaviour often reported in tightly connected populations of neurons [42]. This component is the key of the feedback mechanism used to increase the responsiveness of the populations encoding the expected parts of the sweeps (Fig 6). The analytic formulation of τ pop (h, I) stems from a theoretical study of networks of exponential-integrate-and-fire neurons [42].
Inputs I f n ðtÞ were modelled with AMPA synaptic dynamics [41]. AMPA synapses present short time constants that are able to preserve the fine temporal structure of auditory input, and thus are the major receptor type conveying bottom-up communication in the auditory pathway (e.g., [64]).
where J AMPA in is the effective synaptic efficacy of the peripheral input. We allowed some dispersion in the propagation from the peripheral model to the spectral layer by using a Gaussianshaped connectivity matrix. This ensured that the bandwidth of the self-excitation in the spectral representation is independent of the number of cochlear channels: Note that we used the index f to denote variables in the spectral layer. The perceived pitch corresponded to the expected cochlear channel k, E[k], according to a probability distribution ρ derived from the integral of h f n ðtÞ over the duration of the stimulus L (cf. Eq (2)): The time constant τ AMPA = 2 ms was taken from the literature [41]. The effective conductivity J AMPA in ¼ 0:38 nA was manually tuned within the realistic range such that the peripheral system would elicit firing rates on the range 5Hz � h n (t) � 100 Hz in the integrator ensembles. The transfer function and its parameters, empirically derived for networks of integrate-andfire neurons, were taken from [62].

Sweep layer and direction selectivity.
We used delayed excitation [10,26,35] to model FM-direction selectivity. Two additional mechanisms for FM direction selectivity have been identified in IC, MGB and auditory cortex in the animal electrophysiology literature: asymmetric sideband inhibition [3,35,36], and duration sensitivity [6,36,37]. Although both delayed excitation and sideband inhibition contribute to direction selectivity in the mammal auditory pathway [3,35,36], the two mechanisms are often redundant and yield equivalent results when embedded in a neuronal model [14]. We chose to use delayed excitation alone for simplicity but, given that all models show similar direction and rate selectivity to FM-sweeps, replacing it by or adding any extra mechanism is unlikely to affect the model predictions.
The sweep layer consists of four arrays of N = 100 neural populations following the same dynamics described in the previous section (i.e., Eq (5)). From the four arrays, two (one excitatory, one inhibitory) are tuned to up sweeps, and two (again, one excitatory and one inhibitory) are tuned to down sweeps (Fig 3). The neural populations are characterised by the instantaneous firing rates h "e n ðtÞ; h "i n ðtÞ (up) and h #e n ðtÞ; h #i n ðtÞ (down), and receive synaptic inputs I "e n ðtÞ; I "i n ðtÞ (up) and I #e n ðtÞ; I #i n ðtÞ (down), respectively for excitatory and inhibitory populations. Although the transfer functions ϕ(x) are the same for all the ensembles, the parameters c, I 0 , and g are different for excitatory and inhibitory populations [62].
Excitatory and inhibitory inputs to populations in the sweep layer are modelled according to AMPA-like and GABA-like synaptic gating dynamics [41]: The excitatory-to-inhibitory and inhibitory-to-excitatory connectivity matrices ω ei and ω ie are Gaussian shaped and centred in the identity matrix: The remaining connectivity matrices ω f" and ω f# are defined to constraint the up (down) feed to inputs from lower (higher) frequencies and to limit the range of the connection to a small number of populations Δ ωf of the spectral representation: The free parameters were initialised to standard values (the effective conductivities J AMPA f , J GABA , and J AMPA s , according to [62]; the baseline delay δt 0 to 2 ms/channel; and the dispersion constants σ in , σ ei , σ ei , and Δ ωf , to 0.1N) and manually tuned so that the networks showed direction selectivity for the FM-sweep characteristics (duration, rates, Δf) of the stimuli used in the first part of the study. Unless stated otherwise, all simulations listed in this work correspond to the parameters listed in Table 1.
The direction selectivity index (DSI; e.g., [11]) described in the Results section was computed as the proportion of the activity elicited in a network by an up sweep minus the activity elicited in the same network by a down sweep with the same duration and frequency span: dtð½h ae n ðtÞ� þDf À ½h ae n ðtÞ� À Df Þ P n R dtð½h ae n ðtÞ� þDf þ ½h ae n ðtÞ� À Df Þ where ½h ae n ðtÞ� Df is the firing rate h ae n ðtÞ elicited in the network by a sweep with a frequency span Δf.
Feedback connections. Feedback connections from the sweep layers to the spectral layer were modelled according to NMDA-like synaptic gating dynamics with a finite rising time constant [41].  Table 1. Model parameters. Most parameters were taken from the original studies that derived the mean field approximations used in the model and are cited accordingly. Other free parameters, like the number of bins of the tonotopic axis N, were fixed to reasonable but arbitrary values at the beginning of the model construction and were not adjusted during the analyses (ad-hoc). Free parameters that were manually tuned are labelled as tuned (x), where x is: 1, for parameters tuned so that the spectral layer integrates the peripheral representation correctly; 2, for parameters tuned to achieve FM-direction selectivity; and 3, for parameters tuned so that the feedback signalling resulted in a fair fit between the model's pitch predictions and the experimental observations. Short description of the parameters in the last column are further explained along the Methods section. Connectivity parameters are more eloquently described in Fig 13. parameter The gap w ωs > 0 is enforced to avoid resonances between sweep-selective and spectral populations with the same centre frequency during the encoding of pure tones. The free parameters were initialised to standard values (the NMDA conductivity J NMDA to the value recommended by [62], and the connectivity parameters w ωs and Δ ωs to 0.1N) and manually tuned so that the pitch predictions of the model (as computed in Eq (10)) matched the empirical data. Kernel density estimations of the difference between the perceived pitch evaluated when the sweep was presented before the probe tone f perceived and the perceived pitch evaluated when the probe tone was presented before the sweep f ! perceived ; no systematic effect of the presentation order was found for any of the conditions. Each sample of the distributions corresponds to the difference of the average perceived pitch between presentation orders of the same Δf for a given subject and centre frequency (N = 8 × 3 = 24). Error bars show the average and the standard error of the groups. Average difference across Δf was 145Hz ± 227Hz, largely overlapping 0. (EPS) S1 Table. Summary statistics on the relationship between the perceived pitch and the frequency span for single sweeps and sweep trains. The slope of the linear fit, Pearson's correlation r p , and Spearman's correlation r s for the relationship between f perceived and Δf are presented for each centre frequency � f and direction of the presentation (probe before sweep, !; and sweep before probe, ). Spearman's correlation is systematically larger than Pearson's, indicating that the elicited pitch is related to Δf in a non-linear monotonic way. (TEX) S1 Sounds. Stimuli used in the experiments. Each waveform corresponds to each of the single-sweeps and sweep-trains used in the first and second experiment, respectively. File names indicate the properties of the stimulus as follows: [sweep/train]_fbar< � f >_delta<Δf >.wav; e.g., train_fbar1200Hz_delta-333Hz.wav is the sweep-train with � f ¼ 1200 Hz and Δf = −333Hz used in the second experiment.

Supporting information
(ZIP)