Time-Warp–Invariant Neuronal Processing

A biophysical mechanism acting in auditory neurons allows the brain to process the high variability of speaking rates in natural speech in a time-warp-invariant manner.


Introduction
Robustness of neuronal information processing to temporal warping of natural stimuli poses a difficult computational challenge to the brain [1][2][3][4][5][6][7][8][9]. This is particularly true for auditory stimuli, which often carry perceptually relevant information in fine differences between temporal cues [10,11]. For instance in speech, perceptual discriminations between consonants often rely on differences in voice onset times, burst durations, or durations of spectral transitions [12,13]. A striking feature of human performance on such tasks is that it is resilient to a large temporal variability in the absolute timing of these cues. Specifically, changes in speaking rate in ongoing natural speech introduce temporal warping of the acoustic signal on a scale of hundreds of milliseconds, encompassing temporal distortions of acoustic cues that range from 2-fold compression to 2-fold dilation [14,15]. Figure 1 shows examples of time warp in natural speech. The utterance of the word ''one'' in (A) is compressed by nearly a factor of one-half relative to the utterance shown in (B), causing a concomitant compression in the duration of prominent spectral features, such as the transitions of the peaks in the frequency spectra. Notably, the pattern of temporal warping in speech can vary within a single utterance on a scale of hundreds of milliseconds. For example, the local time warp of the word ''eight'' in (C) relative to (D), reverses from compression in the initial and final segments to strong dilation of the gap between them. Although it has long been demonstrated that speech perception in humans normalizes durations of temporal cues to the rate of speech [2,[16][17][18], the neural mechanisms underlying this perceptual constancy have remained mysterious.
A general solution of the time-warp problem is to undo stimulus rate variations by comodulating the internal ''perceptual'' clock of a sensory processing system. This clock should run slowly when the rate of the incoming signal is low and embedded temporal cues are dilated, but accelerate when the rate is fast and the temporal cues are compressed. Here, we propose a neural implementation of this solution, exploiting a basic biophysical property of synaptic inputs, namely, that in addition to charging the postsynaptic neuronal membrane, synaptic conductances modulate its effective time constant. To utilize this mechanism for time-warp robust information processing in the context of a particular perceptual task, synaptic peak conductances at the site of temporal cue integration need to be adjusted to match the range of incoming spike rates. We show that such adjustments can be achieved by a novel conductance-based supervised learning rule. We first demonstrate the computational power of the proposed mechanism by testing our neuron model on a synthetic instantiation of a generic time-warp-invariant neuronal computation, namely, timewarp-invariant classification of random spike latency patterns. We then present a novel neuronal network model for word recognition and show that it yields excellent performance on a benchmark speech-recognition task, comparable to that achieved by highly elaborate, biologically implausible state-of-the-art speech-recognition algorithms.

Time Rescaling in Neuronal Circuits
Whereas the net current flow into a neuron is determined by the balance between excitatory and inhibitory synaptic inputs, both types of inputs increase the total synaptic conductance, which in turn modulates the effective integration time of the postsynaptic cell [19][20][21] (an effect known as synaptic shunting). Specifically, when the total synaptic conductance of a neuron is large relative to the resting conductance (leak) and is generated by linear summation of incoming synaptic events, the neuron's effective integration time scales inversely to the rate of inputs spikes. Hence, the shunting action of synaptic conductances can counter variations in afferent spike rates by automatically rescaling the effective integration time of the postsynaptic neuron.
We implement this mechanism in a leaky integrate-and-fire model neuron driven by N exponentially decaying synaptic conductances g i t ð Þ~g max i exp {t=t s ð Þi~1, . . . ,N ð Þ . Here, g max i denotes the peak conductance of the ith synapse in units of sec 21 , and t s is the synaptic time constant. The total synaptic current, measured at rest, is given by where V rev i denotes the reversal potential of the ith synapse relative to resting potential and t i denote the arrival times of the spikes of the ith afferent. The factor b denotes a global scaling of all incoming spike times; b = 1 is the unwarped inputs. The total synaptic conductance, G syn (t,b), is For fast synapses, the total synaptic current is essentially a train of pulses, each of which occurs at the time of an incoming spike and delivers a total charge of g i t s V rev i . Changing the rate of the incoming spikes will induce a corresponding change in the timing of these pulses but not their charge. Therefore, ignoring the effect of time warp on the time scale of t s , which is short relative to the time scale of voltage modulations, the total synaptic current obeys the following time-warp scaling relation, I syn (bt,b) = b 21 I syn (t,1). A similar scaling relation holds for the total synaptic conductance. The evolution in time of the subthreshold voltage is given by Thus, V integrates the synaptic current with an effective time constant whose inverse is 1/t eff = g leak +G syn (t,b). If the contribution of G syn is significantly larger than the leak conductance, then 1/t eff is rescaled by time-warp similar to G syn and I syn , and, hence, the solution of Equation 1 is approximately time-warp invariant, namely, V(bt,b) = V(t,1). This result is illustrated in Figure 2, which compares the voltage traces induced by a random spike pattern for b = 1 and b = 0.5.
To perform time-warp-invariant tasks, peak synaptic conductances must be in the range of values appropriate for the statistics of the stimulus ensemble of the given task. To achieve this, we have devised a novel spike-based learning rule for synaptic conductances, the conductance-based tempotron. This model neuron learns to discriminate between two classes of spatiotemporal input spike patterns. The tempotron's classification rule requires it to fire at least one spike in response to each of its target stimuli but to remain silent when driven by a stimulus from the null class. Spike patterns from both classes are iteratively presented to the neuron, and peak synaptic conductances are modified after each error trial by an amount proportional to their contribution to the maximum value of the postsynaptic potential over time (see Materials and Methods). This contribution is sensitive to the time courses of the total conductance and voltage of the postsynaptic neuron. Therefore, the conductance-based tempotron learns to adjust, not only the magnitude of the synaptic inputs, but also its effective integration time to the statistics of the task at hand.

Learning to Classify Time-Warped Latency Patterns
We first quantified the time-warp robustness of the conductance-based tempotron on a synthetic discrimination task. We randomly assigned 1,250 spike pattern templates to target and null classes. The templates consisted of 500 afferents, each firing once at a fixed time chosen randomly from a uniform distribution between 0 and 500 ms. Upon each presentation during training and testing, the templates underwent global temporal warping by a random factor b ranging from compression by 1/b max to dilation by b max (see Materials and Methods). Consistent with the psychophysical range, b max was varied between 1 and 2.5. Remarkably, with physiologically plausible parameters, the error frequency remained almost zero up to b max <2 ( Figure 3A, blue curve). Importantly, the performance of the conductance-based tempotron showed little change when the temporal warping applied to the spike templates was dynamic (see Materials and Methods) ( Figure 3A). The time-warp robustness of the neural classification depends on the resting membrane time constant t m and the synaptic time constant t s . Increases in t m or decreases in t s both enhance the dominance of shunting in governing the cell's effective time constant. As a result, the performance for b max = 2.5 improved with increasing t m ( Figure 3B, left) and decreasing t s ( Figure 3B, right). The time-warp robustness of the conductancebased tempotron was also reflected in the shape of its subthreshold voltage traces ( Figure 3C, top row) and generalized to novel spike templates with the same input statistics that were not used during training ( Figure 3C, second row).

Author Summary
The brain has a robust ability to process sensory stimuli, even when those stimuli are warped in time. The most prominent example of such perceptual robustness occurs in speech communication. Rates of speech can be highly variable both within and across speakers, yet our perceptions of words remain stable. The neuronal mechanisms that subserve invariance to time warping without compromising our ability to discriminate between fine temporal cues have puzzled neuroscientists for several decades. Here, we describe a cellular process whereby auditory neurons recalibrate, on the fly, their perceptual clocks and allows them effectively to correct for temporal fluctuations in the rate of incoming sensory events. We demonstrate that this basic biophysical mechanism allows simple neural architectures to solve a standard benchmark speech-recognition task with near perfect performance. This proposed mechanism for time-warp-invariant neural processing leads to novel hypotheses about the origin of speech perception pathologies.
Synaptic conductances were crucial in generating the neuron's robustness to temporal warping. Athough an analogous neuron model with a fixed integration time, the current-based tempotron [22] (see Materials and Methods) also performed the task perfectly in the absence of time-warp (b max = 1); its error frequency was sensitive even to modest temporal warping and deteriorated further when the applied time warp was dynamic ( Figure 3A, red curve). Similarly, the voltage traces of this current-based neuron showed strong dependence on the degree of temporal warping applied to an input spike train ( Figure 3C, bottom trace pair). Finally, the error frequency of the current-based neuron at b max = 2.5 showed only negligible improvement upon varying the values of the membrane and synaptic time constants ( Figure 3B), highlighting the limited capabilities of fixed neural kinetics to subserve time-warp-invariant spike-pattern classification.
Note that in the present classification task, the degree of timewarp robustness depends also on the learning load, i.e., number of  patterns that have to classified by a neuron (unpublished data). A given degree of time warp translates into a finite range of distortions of the intracellular voltage traces. If these distortions remain smaller than the margins separating the neuronal firing threshold and the intracellular peak voltages, a neuron's classification will be time-warp invariant. Since the maximal possible margins increase with decreasing learning load, time-warp invariance can be traded for storage capacity. This tradeoff is governed by the susceptibility of the voltage traces to time warp. If the susceptibility is high, as in the current-based tempotron, robustness to time warp comes at the expense of a substantial reduction in storage capacity. If it is low, as in the conductancebased tempotron, time-warp invariance can be achieved even when operating close to the neuron's maximal storage capacity for unwarped patterns.

Adaptive Plasticity Window
In the conductance-based tempotron, synaptic conductances controlled, not only the effective integration time of the neuron, but also the temporal selectivity of the synaptic update during learning. The tempotron learning rule modifies only the efficacies of the synapses that were activated in a temporal window prior to the peak in the postsynaptic voltage trace. However, the width of this temporal plasticity window is not fixed but depends on the effective integration time of the postsynaptic neuron at the time of each synaptic update trial, which in turn varies with the input firing rate at each trial and the strength of the peak synaptic conductances at this stage of learning ( Figure 4). During epochs of high conductance (warm colors), only synapses that fired shortly before the voltage maximum were appreciably modified. In contrast, when the membrane conductance was low (cool colors), the plasticity window was broad. The ability of the plasticity window to adjust to the effective time constant of the postsynaptic voltage is crucial for the success of the learning. As is evident from Figure 4, the membrane's effective time constant varies consider-  ably during the learning epochs; hence, a plasticity rule that does not take this into account fails to credit appropriately the different synapses.

Task Dependence of Learned Synaptic Conductance
The evolution of synaptic peak conductances during learning was driven by task requirements. When we replaced the temporal warping of the spike templates by random Gaussian jitter [22] (see Materials and Methods), conductance-based tempotrons that had acquired high synaptic peak conductances during initial training on the time-warp task readjusted their synaptic peak conductances to low values ( Figure 5, inset). The concomitant increase in their effective integration time constants from roughly 10 ms to 50 ms improved the neurons' ability to average out the temporal spike jitter and substantially enhanced their task performance ( Figure 5).

Neuronal Model of Word Recognition
To address time-warp-invariant speech processing, we studied a neuronal module that learns to perform word-recognition tasks. Our model consists of two auditory processing stages. The first stage ( Figure 6) consists of an afferent population of neurons that convert incoming acoustic signals into spike patterns by encoding the occurrences of elementary spectrotemporal events. This layer forms a 2-dimensional tonotopy-intensity auditory map. Each of Figure 5. Task dependence of the learned total synaptic conductance. Error frequency of the conductance-based tempotron versus its effective integration time t eff . After switching from time-warp to Gaussian spike jitter, t eff increased as the mean time-averaged total synaptic conductance G decreased with learning time (inset). doi:10.1371/journal.pbio.1000141.g005 its afferents generates spikes by performing an onset or offset threshold operation on the power of the acoustic signal in a given frequency band. Whereas an onset afferent elicits a spike whenever the log signal power crosses its threshold level from below, offset afferents encode the occurrences of downward crossings (see Materials and Methods) (cf. also [6,23]). Different on and off neurons coding for the same frequency band differ in their threshold value, reflecting a systematic variation in their intensity tuning. The second, downstream, layer consists of neurons with plastic synaptic peak conductances that are governed by the conductance-based tempotron plasticity rule. These neurons are trained to perform word discrimination tasks. We tested this model on a digit-recognition benchmark task with the TI46 database [24]. We trained each of the 20 conductance-based tempotrons of the second layer to perform a distinct gender-specific binary classification, requiring it to fire in response to utterances of one digit and speaker gender, and to remain quiescent for all other stimuli. After training, the majority of these digit detector neurons (70%) achieved perfect classification of the test set, and the remaining ones performed their task with a low error (Table 1). Based on the spiking activity of this small population of digit detector neurons, a full digit classifier (see Materials and Methods) that weighted spikes according to each detector's individual performance, achieved an overall word error rate of 0.0017. This performance matches the error rates of state-of-the-art artificial speech-recognition systems such as the Hidden Markov modelbased Sphinx-4 and HTK, which yield error rates of 0.0017 [25] and 0.0012 [26], respectively, on the same benchmark.

Learned Spectrotemporal Target Features
To reveal qualitatively some of the mechanisms used by our digit detector neurons to selectively detect their target word, we compared the learned synaptic distributions ( Figure 7A) of two digit detector neurons (''one'' and ''four'') to the average spectrograms of each neuron's target stimuli aligned to the times of its output spikes ( Figure 7B; see Materials and Methods). The spectrotemporal features that preceeded the output spikes (time zero, grey vertical lines) corresponded to the frequency-specific onset and offset selectivity of the excitatory afferents ( Figure 7A, warm colors). These examples demonstrate how the patterned excitatory and inhibitory inputs from both onset and offset neurons are tuned to features of the speech signal. For instance, a prominent feature in the averaged spectrogram of the word ''one'' (male speakers) was the increase in onset time of the power in the low-frequency channels with the frequency of the channel ( Figure 7B, left, channels 1-16). This gradual onset was encoded by a diagonal band of excitatory onset afferents whose thresholds decreased with increasing frequency ( Figure 7A, left). By compensating for the temporal lag between the different lowerfrequency channels, this arrangement ensured a strong excitatory drive when a target stimulus was presented to the neuron. The spectrotemporal feature learned by the word ''four'' (male speakers) detector neuron combined decreasing power in the low-frequency range with rising power in the mid-frequency range ( Figure 7B, right). This feature was encoded by synaptic efficacies through a combination of excitatory offset afferents in the lowfrequency range ( Figure 7A, right, channels 1-11) and excitatory onset afferents in the mid-frequency range (channels [12][13][14][15][16][17][18][19]. Excitatory synaptic populations were complemented by inhibitory inputs ( Figure 7A, blue patches) that prevented spiking in response to null stimuli and also increased the total synaptic conductance. The substantial differences between the mean spike-triggered voltage traces for target stimuli ( Figure 7C, blue) and the mean maximum-triggered voltage traces for null stimuli (red) underline the high target word selectivity of the learned synaptic distributions as well as the relatively short temporal extend of the learned target features.
In the examples shown, the average position of the neural decision relative to the target stimuli varied from early to late ( Figure 7B, left vs. right). This important degree of freedom stems from the fact that the tempotron decision rule does not constrain the time of the neural decision. As a result, the learning process in each neuron can select the spectrotemporal target features from any time window within the word. The selection of the target feature by the learning takes into account both the requirement of triggering output spikes in response to target stimuli as well as the demand to remain silent during null stimuli. Thus, for each target neuron, the selected features reflect the statistics of both the target and the null stimuli.

Generalization Abilities of Word Detector Neurons
We have performed several tests designed to assess the ability of the model word detector neurons to perform well on new input sets, different in statistics from the trained database. First, we assessed the ability of the neurons to generalize to unfamiliar speakers and dialects. After training the model with the TI46 database, as described above, we measured its digit-recognition performance on utterances from another database, the TIDIGITS database [27], which includes speech samples from a variety of English dialects (see Materials and Methods). This test has been done without any retraining of the network synapses. The resulting word error rate of 0.0949 compares favorably to the performance of the HTK system, which resulted in an error rate of 0.2156 when subjected to the same generalization test (see Materials and Methods). Across all dialects, our model performed perfectly for roughly one-quarter of all speakers and with at most one error for half of them. Within the best dialect group, an error of at most one word was achieved for as many as 80% of the speakers (Table S1). These results underline the ability of our neuronal wordrecognition model to generalize to unfamiliar speakers across a wide range of different unfamiliar dialects.
An interesting question is whether our model neurons are able to generalize their performance to novel time-warped versions of the trained inputs. To address this question, we have tested their performance on randomly generated time-warped versions of the input spikes corresponding to the trained word utterances, without retraining. As shown in Figure 8, the neurons exhibited considerable time-warp-robust performance on the digit-recogni-   Figure 7) were insensitive to a 2-fold time warp of the input spike trains. The ''seven'' detector neuron (male, red line) showed higher sensitivity to such warping; nevertheless, its error rate remained low. Consistent with the proposed role of synaptic conductances, the degree of time-warp robustness was correlated with the total synaptic conductance, here quantified through the mean effective integration time t eff ( Figure 8B). Additionally, the mean voltage traces induced by the target stimuli ( Figure 8C, lower traces) showed a substantially smaller sensitivity to temporal warping than their current-based analogs (see Materials and Methods) ( Figure 8C, upper traces).
We also found that our model word detector neurons are robust to the introduction of spike failures in their input patterns. For each neuron, we have measured its performance on inputs which were corrupted by randomly deleting a fraction of the incoming spikes, again without retraining. For the majority of neurons, the error percentage increased by less than 0.01% for each percent increase in spike failures (Figure 9). This high robustness reflects the fact that each classification is based on integrating information from many presynaptic sources.

Automatic Rescaling of Effective Integration Time by Synaptic Conductances
The proposed conductance-based time-rescaling mechanism is based on the biophysical property of neurons that their effective integration time is shaped by synaptic conductances and therefore can be modulated by the firing rate of its afferents. To utilize these modulations for time-warp-invariant processing, a central requirement is a large evoked total synaptic conductance that dominates the effective integration time constant of the postsynaptic cell through shunting. In our speech-processing model, large synaptic conductances with a median value of a 3-fold leak conductance across all digit detector neurons (cf. Figure 8B) result from a combination of excitatory and inhibitory inputs. This is consistent with high total synaptic conductances, comprising excitation and inhibition, that have been observed in several regions of cortex [28] including auditory [29,30], visual [31,32], and also prefrontal [33,34] (but see ref. [35]). Our model predicts that in cortical sensory areas, the time-rescaled intracellular voltage traces (cf. Figure 3C), and consequently, also the rescaled spiking responses of neurons that operate in the proposed fashion, remain invariant under temporal warping of the neurons' input spike patterns. These predictions can be tested by intra-and extracellular recordings of neuronal responses to temporally warped sensory stimuli.
A large total synaptic conductance is associated with a substantial reduction in a neuron's effective integration time relative to its resting value. Therefore, the resting membrane time constant of a neuron that implements the automatic time-rescaling mechanism must substantially exceed the temporal resolution that is required by a given processing task. Because the wordrecognition benchmark task used here comprises whole-word stimuli that favored effective time constants on the order of several tens of milliseconds, we used a resting membrane time constant of t m = 100 ms. Whereas values of this order have been reported in hippocampus [36] and cerebellum [21,37], it exceeds current estimates for neocortical neurons, which range between 10 and 30 ms [35,38,39]. Note, however, that the correspondence of our passive membrane model and the experimental values that typically include contributions from various voltage-dependent conductances is not straightforward. Our model predicts that neurons specialized for time-warp-invariant processing at the whole-word level have relatively long resting membrane time constants. It is likely that the auditory system solves the problem of time-warp-invariant processing of the sound signal primarily at the level of shorter speech segments such as phonemes. This is supported by evidence that primary auditory cortex has a special role in speech processing at a resolution of milliseconds to tens of milliseconds [11][12][13]. Our mechanism would enable time-warpinvariant processing of phonetic segments with resting membrane time constants in the range of tens of milliseconds, and much shorter effective integration times.
The proposed neuronal time-rescaling mechanism assumes linear summation of synaptic conductances. This assumption is challenged by the presence of voltage-dependent conductances in neuronal membranes. Since the potential implications for our model depend on the specific nonlinearity induced by a cell-typespecific composition of different ionic channels, it is hard to evaluate the overall effect on our model in general terms. Nevertheless, because of its immanence, we expect the conductance-based time-rescaling mechanism to cope gracefully with moderate levels of nonlinearity. As an example, we tested its behavior in the presence of an h-like conductance (see Materials and Methods) that opposes conductance changes induced by depolarizing excitatory synaptic inputs and is active at the resting potential. As expected, we found that physiological levels of hconductances resulted in only moderate impairment of the automatic time-rescaling mechanism ( Figure S1).
For the sake of simplicity as well as numerical efficiency, we have assumed symmetric roles of excitation and inhibition in our model architecture. We have checked that this assumption is not crucial for the operation of the automatic time-rescaling mechanism and the learning of time-warped random latency patterns. Specifically, we have implemented the random latency classification task for a control architecture in which all synapses were confined to be excitatory except a single global inhibitory input that, mimicking a global inhibitory network, received a separate copy of all incoming spikes. In this architecture, all spike patterns have to be encoded by the excitatory synaptic population, and the role of inhibition is reduced to a global signal that has equal strength for all input patterns. Due to the limitations of this architecture, this model showed some reduction of storage capacity relative to the symmetric case, but the automatic timerescaling mechanism remained intact. For a time-warp scale of b max = 2.5 (cf. Figure 3), the global inhibition model roughly matched the performance of the symmetric model when the learning load was lowered to 1.5 spike patterns per synapse, with an error fraction of 0.18%.

Supervised Learning of Synaptic Conductances
To utilize synaptic conductances as efficient controls of the neuron's clock, the peak synaptic conductances must be plastic so that they adjust to the range of integration times relevant for a given perceptual task. This was achieved in our model by our novel supervised spike-based learning rule. This plasticity posits that the temporal window during which pre-and postsynaptic activity interact continuously adapts to the effective integration time of the postsynaptic cell ( Figure 4). The polarity of synaptic changes is determined by a supervisory signal that we hypothesize to be realized through neuromodulatory control [22]. Because present experimental measurements of spike-timing-dependent synaptic plasticity rules have assumed an unsupervised setting, i.e., have not controlled for neuromodulatory signals (but see [40]), existing results do not directly apply to our model. Nevertheless, recent data have revealed complex interactions between the statistics of pre-and postsynaptic spiking activity and the expression of synaptic changes [41][42][43][44]. Our model offers a novel computational rationale for such interactions, predicting that for fixed supervisory signaling, the temporal window of plasticity shrinks with growing levels of postsynaptic shunting. One challenge for the biological implementation of the tempotron learning rule is the need to compute the time of the maximum of the postsynaptic voltage. We have previously shown for a currentbased neuron model that this temporally global operation can be approximated by temporally local computations that are based on the postsynaptic voltage traces following input spikes [22]. We have extended this approach to plastic synaptic conductances and checked that the resulting biologically plausible implementation of conductance-based tempotron learning can readily subserve timewarp-invariant classification of spike patterns. Specifically, in this implementation, the induction of synaptic plasticity is controled by the correlation of the postsynaptic voltage and a synaptic learning kernel (see Materials and Methods) whose temporal extend is controlled by the average conductance throughout a given error trial. A synaptic peak conductance is changed by a uniform amount whenever this correlation exceeds a fixed plasticity induction threshold. When tested on the time-warped latency patterns with b max = 2.5 (cf. Figure 3), the correlation-based tempotron roughly matched the voltage maximum-based version at a reduced learning load of 1.5 patterns per synapse with an error fractions of 0.35%.

Time-Warp Invariance Is Task Dependent
In our model, dynamic time-warp-invariant capabilities become avaliable through a conductance-based learning rule that tunes the shunting action of synaptic conductances. This learning rule enables neurons to adjust the degree of synaptic shunting to the requirements of a given processing task. As a result, our model can naturally encompass a continuum of functional specializations ranging from neurons that are sensitive to absolute stimulus durations by employing low total synaptic conductances, to timewarp-invariant feature detectors that operate in a high-conductance regime. In the context of auditory processing, such a functional segregation into neurons with slower and faster effective integration times is reminiscent of reports suggesting that rapid temporal processing in time frames of tens of milliseconds is localized in left lateralized language areas, whereas processing of slower temporal features is attributed to right hemispheric areas [45][46][47]. Although anatomical and morphological asymmetries between left and right human auditory cortices are well documented [48], it remains to be seen whether these differences form the physiological substrate for a left lateralized implementation of the proposed time-rescaling mechanism. Consistent with this picture, the general tradeoff between high temporal resolution and robustness to temporal jitter that is predicted by our model ( Figure 5) parallels reports of the vulnerability of the lateralizion of Figure 9. Robustness to spike failures. The error fraction of each digit detector neuron was measured as a function of the spike failure probability over the range from 0% to 10% and fitted by linear regression. For each neuron, the resulting slope (median 0.0069) is plotted versus the intercept (median 0.0061) with symbols and colors as in Figure 8B. The median R 2 of the linear regression fits was 0.94. The inset shows the median error fraction of the population as a function of the spike failure probability in the range of 1% to 50% with the robust regime braking down at approximately 20%. doi:10.1371/journal.pbio.1000141.g009 language processing with respect to background acoustic noise [49] as well as to abnormal timing of auditory brainstem responses [50].

Neuronal Circuitry for Time-Warp-Invariant Feature Detection
The architecture of our speech-processing model encompasses two auditory processing stages. The first stage transforms acoustic signals into spatiotemporal patterns of spikes. To engage the proposed automatic time-rescaling mechanism, the population rate of spikes elicited in this afferent layer must track variations in the rate of incoming speech. Such behavior emerges naturally in a sparse coding scheme in which each neuron responds transiently to the occurrences of a specific acoustic event within the auditory input. As a result, variations in the rate of acoustic events are directly translated into concomitant variations in the population rate of elicited spikes. In our model, the elementary acoustic events correspond to onset and offset threshold crossings of signal power within specific frequency channels. Such frequency-tuned onset and offset responses featuring a wide range of dynamic thresholds have been observed in the inferior colliculus (IC) of the auditory midbrain [51]. This nucleus is the site of convergence of projections from the majority of lower auditory nuclei and is often referred to as the interface between the lower brain stem auditory pathways and the auditory cortex. Correspondingly, we hypothesize that the layer of time-warp-invariant feature detector neurons in our model implements neurons located downstream of the IC, most probably in primary auditory cortex. Current studies on the functional role of the auditory periphery in speech perception and its pathologies have been limited by the lack of biologically plausible neuronal readout architectures; a limitation overcome by our model, which allows evaluation of specific components of the auditory pathway in a functional context.

Implications for Speech Processing
Psychoacoustic studies have indicated that the neural mechanism underlying the perceptual normalization of temporal speech cues is involuntary, i.e., it is cognitively impenetrable [16], controlled by physical rather than perceived speaking rate [17], confined to a temporally local context [2,18], not specific to speech sounds [52], and already operational in prearticulate infants [53]. The proposed conductance-based time-rescaling mechanism is consistent with these constraints. Moreover, our model posits a direct functional relation between high synaptic conductances and the time-warp robustness of human speech perception. This relation gives rise to a novel mechanistic hypothesis explaining the impaired capabilities of elderly listeners to process time-compressed speech [54,55]. We hypothesize that the down-regulation of inhibitory neurotransmitter systems in aging mammalian auditory pathways [56,57] limits the total synaptic conductance and therefore prevents the time-rescaling mechanism from generating short, effective time constants through synaptic shunting. Furthermore, our model implies that comprehension deficits in older adults should be linked specifically to the processing of phonetic segments that contain fast time-compressed temporal cues. Our hypothesis is consistent with two interrelated lines of evidence. First, comprehension difficulties of timecompressed speech in older adults are more likely a consequence of an age-related decline in central auditory processing than attributes of a general cognitive slowing [56,58]. Second, recent reports have indicated that recognition differences between young and elderly listeners originate mainly from the temporal compression of consonants, which often feature rapid spectral transitions, but not from steady-state segments [54,55,58] of speech. Finally, our hypothesis posits that speaking rate-induced shifts in perceptual category boundaries [2,16,17] should be agedependent, i.e., their magnitude should decrease with increasing listener age. This prediction is straightforwardly testable within established psychoacoustic paradigms.

Connections to Other Models of Time-Warp-Invariant Processing
In a previous neuronal model of time-warp-invariant speech processing [5,6], sequences of acoustic events are converted into patterns of transiently matching firing rates in subsets of neurons within a population, which trigger synchronous firing in a downstream readout circuit. The identity of neurons whose firing rates converge to an identical value during an input pattern, and hence also the pattern of synchrony emerging in the readout layer, depends only on the relative timing of the events, not on the absolute duration of the auditory signal. However, for this model to recognize multiple input patterns, the convergence of firing rates is only approximate. Therefore, the resulting time-warp robustness is limited and, as in our model, dependent on the learning load. Testing this model on our synthetic classification task (cf. Figure 3) indicated a substantially smaller storage capacity than is realizable by the conductance-based tempotron (Text S1). An additional disadvantage of this approach is that it copes only with global (uniform) temporal warping. Invariant processing of dynamic time warp as is exhibited by natural speech (cf. Figure 1C and 1D) is more challenging since it requires robustness to local temporal distortions of a certain statistical character. Established algorithms that can cope with dynamically time-warped signals are typically based on minimizing the deviation between an observed signal and a stored reference template [59][60][61]. These algorithms are computationally expensive and lack biologically plausible neuronal implementations. By contrast, our conductance-based time-rescaling mechanism results naturally from the biophysical properties of input integration at the neuronal membrane and does not require dedicated computational resources. Importantly, our model does not rely on a comparison between the incoming signal and a stored reference template. Rather, after synaptic conductances have adjusted to the statistics of a given stimulus ensemble, the mechanism generalizes and automatically stabilizes neuronal voltage responses against dynamic time warp even when processing novel stimuli (cf. Figure 3C). The architecture of our neuronal model also fundamentally departs from the decades-old layout of Hidden Markov Model-based artificial speech-recognition systems, which employ probabilistic models of state sequences. These systems are hard to reconcile with the biological reality of neuronal system architecture, dynamics, and plasticity. The similarity in performance between our model and such state-ofthe-art systems on a small vocabulary task highlights the powerful processing capabilities of spike-based neural representations and computation.

Generality of Mechanism
Although the present work focuses on the concrete and welldocumented example of time-warp robustness in the context of neural speech processing, the proposed mechanism of automatic rescaling of integration time is general and applies also to other problems of neuronal temporal processing such as birdsong recognition [3], insect communication [9], and other ethologically important natural auditory signals. Moreover, robustness of neuronal processing to temporal distortions of spike patterns is not only important for the processing of stimulus time dependencies, but also in the context of spike-timing-based neuronal codes in which the precise temporal structure of spiking activity encodes information about nontemporal physical stimulus dimensions [62]. Evidence for such temporal neural codes have been reported in the visual [63][64][65], auditory [66], and somatosensory [67], as well as the olfactory [68] pathways. As a result, we expect mechanisms of time-warp-invariant processing to also play a role in generating perceptual constancies along nontemporal stimulus dimensions such as contrast invariance in vision or concentration invariance in olfaction [4]. Finally, time warp has also been described in intrinsically generated brain signals. Specifically, the replay of hippocampal and cortical spiking activity at variable temporal warping [69,70] suggests that our model has applicability beyond sensory processing, possibly also encompassing memory storage and retrieval.

Conductance-Based Neuron Model
Numerical simulations of the conductance-based tempotron were based on exact integration [71] of the conductance-based voltage dynamics defined in Equation 1. With the membrane capacitance set to 1, the resting membrane time constant in this model is t m = 1/g leak . Implementing an integrate-and-fire neuron model, an output spike was elicited when V(t) crossed the firing threshold V thr . After a spike at t spike , the voltage is smoothly reset to the resting value by shunting all synaptic inputs that arrive after t spike (cf. [22]). We used V thr = 1, V rest = 0, and reversal potentials V rev ex~5 and V rev in~{ 1 for excitatory and inhibitory conductances, respectively. Unless stated otherwise, the resting membrane time constant was set to t m = 100 ms throughout our work [20]. For the synaptic time constant, we used t s = 1 ms for the random latency task, which minimized the error of the current-based neuron, and to t s = 5 ms in the speech-recognition tasks.

H-Current
To check the effect of nonsynaptic voltage-dependent conductances on the automatic time-rescaling mechanism, we implemented an h-like current I h after [72] as a noninactivating current with HH-type dynamics of the form Here, g max h is the maximal h-conductance, with reversal potential V rev h~{ 20 mV and m is its voltage-dependent activation variable with kinetics dm dt~m The voltage dependence of the rate constants a and b were described by the form In Figure S1, values of L g max h , b À Á are normalized by L(0,b). The voltage traces were generated by random latency patterns and uniform synaptic peak conductances as used in Figure 2. As increasing values of g max h depolarized the neuron's resting potential, excitatory and inhibitory synaptic conductances were balanced separately for each value of g max h .

Current-Based Neuron Model
In the current-based tempotron that was implemented as described in [22], each input spike evoked an exponentially decaying synaptic current that gave rise to a postsynaptic potential with a fixed temporal profile. In Figure 8C (upper row), voltage traces of a current-based analog of a conductance-based tempotron with learned synaptic conductances g max i , reversal potentials V rev i , and effective membrane integration time t eff (cf. Figure 8B) were computed by setting the synaptic efficacies v i of the current-based neuron to v i~g max i V rev i and its membrane time constant to t m = t eff . The resulting current-based voltage traces were scaled such that for each pair of models, the mean voltage maxima for unwarped stimuli (b = 1) were equal.

Tempotron Learning
Following [22], changes in the synaptic peak conductance g max i of the ith synapse after an error trial were given by the gradient of the postsynaptic potential, Dg max , at the time of its maximal value t max . To compute the synaptic update for a given error trial, the exact solution of Equation 1 was differentiated with respect to g max i and evaluated at t max , which was determined numerically for each error trial. Whenever a synaptic peak conductance attempted to cross to a negative value, its reversal potential was switched.

Voltage Correlation-Based Learning
A voltage correlation-based approximation of tempotron learning was implemented by extending the approach in [22] such that the change in the synaptic peak conductance g max i of the ith synapse due to a spike at time t i was governed by the correlation n i~Ð ? ti dtV t ð ÞK learn t{t i ð Þ of the postsynaptic potential V(t) with a synaptic learning kernel K learn (t) = (exp(2t/ t learn )2exp(2t/t s ))/(t learn 2t s ). The two time constants of the synaptic learning kernel were the synaptic time constant t s and the learning time constant t learn~1 g leak zG syn À Á , which was determined by the time-averaged synaptic conductance G syn of the current error trial and approximated the effective membrane time constant during that trial. The voltage maximum operation was approximated by thresholding n i , yielding for changes of excitatory conductances on target and null patterns, respectively, and changes with the reversed polarity, 61, for inhibitory conductances. The plasticity induction threshold was set to k = 0.45.

Learning Rate and Momentum Term
As in [22], we employed a momentum heuristic to accelerate learning in all learning rules. In this scheme, synaptic updates Dg max i Â Ã current consisted, not only of the correction lDg max i , which was given by the learning rule and the learning rate l, but also incorporated a fraction m of the previous synaptic change We used an adaptive learning rate that decreased from its initial value l ini as the number of learning cycles l grew, l = l ini / (1+10 24 (l21)). A learning cycle corresponded to one iteration through the batch of templates in the random latency task or the training set in the speech task.
Random latency task training. To ensure a fair comparison between the conductance-based and the currentbased tempotrons (cf. Figure 3A), the learning rule parameters l ini and m were optimized for each model. Specifically, for each value of b max , optimal values over a 2-dimensional grid were determined by the minimal error frequency achieved during runs over 10 5 cycles, with synaptic efficacies starting from Gaussian distributions with zero mean and standard deviations of 0.001. The optimization was performed over five realizations.

Global Time Warp
Global time warp was implemented by multiplying all firing times of a spike template by a constant scaling factor b. In Figure 3A, random global time warp between compression by 1/b max and dilation by b max was generated by setting b = exp(qln(b max )) with q drawn from a uniform distribution between 21 and 1 for each presentation.

Dynamic Time Warp
Dynamic time warp was implemented by scaling successive interspike intervals t j 2t j21 of a given template with a timedependent warping factorb b t ð Þ, such that warped spike times The time-dependent factor q q t ð Þ~erfc j t ð Þ ð Þ{1 resulted from an equilibrated Ornstein-Uhlenbeck process j(t) with a relaxation time of t = 200 ms that was rescaled by the complementary error function erfc to transform the normal distribution of j(t) into a uniform distribution over [21 1] at each t.

Global Inhibition Model
To ensure that the symmetry of excitation and inhibition in our model architecture was not crucial for the time-warp-invariant processing of spike patterns, we implemented a control architecture in which all afferents were confined to be excitatory, except one additional inhibitory synapse, which mimicked the effect of a global inhibitory network. In the time-warped random latency task, spike patterns were fed into the excitatory population as before; however, in addition, the inhibitory synapse received a copy of each incoming spike. All synaptic peak conductances were plastic and controlled by the conductance-based tempotron rule. In this model, synaptic sign changes were prohibited.

Gaussian Spike Time Jitter
Spike time jitter [22] was implemented by adding independent Gaussian noise with zero mean and a standard deviation of 5 ms to each spike of a template before each presentation.

Acoustic Front-End
Sound signals were normalized to unit peak amplitude and converted into spectrograms over N FTT = 129 linearly spaced frequencies f j = f min +j(f max +f min )/(N FTT +1) (j = 1… N FTT ) between f min = 130 Hz and f max = 5,400 Hz by a sliding fast Fourier transform with a window size of 256 samples and a temporal step size of 1 ms. The resulting spectrograms were filtered into N f = 32 logarithmically spaced Mel frequency channels by overlapping triangular frequency kernels. Specifically, N f +2 linearly spaced frequencies given by h j = h min +j(h max 2h min )/(N f +1) Þ . After normalization of the resulting Mel spectrogram S Mel to unit peak amplitude, the logarithm was taken through log(S Mel = e)2log(e) with e = 10 25 and the signal in each frequency channel smoothed in time by a Gaussian kernel with a time constant of 10 ms. Spikes were generated by thresholding of the resulting signals by a total of 31 onset and offset thresholdcrossing detector units. Whereas each onset afferent emitted a spike whenever the signal crossed its threshold in the upward direction, offset afferents fired when the signal dropped below the threshold from above. For each frequency channel and each utterance, threshold levels for onset and offset afferents were set relative to the maximum signal over time to q 1~0 :01 and q j~j =15 j~1 . . . 15 ð Þ . For q 15~1 , onset and offset afferents were reduced to a single afferent whose spikes encoded the time of the maximum signal for a given frequency channel.

Speech Databases
We used the digit subset of the TI46 Word speech database [24]. This clear speech dataset comprises 26 isolated utterances of each English digit from zero to nine spoken by 16 adult speakers (eight male and eight female). The data is partitioned into a fixed training set, comprising 10 utterances per digit and speaker, and a fixed test set holding the remaining 16 utterances per digit and speaker. We also tested our neuronal word-recognition model on the adult speaker, isolated-digit subset of the TIDIGITS database [27]. This subset comprises two utterances per digit and speaker, i.e., a total of 20 utterances from 225 adult speakers (111 male and 114 female), that are dialectically balanced across 21 dialectical regions (tiling the continental United States). Because the TI46 database only provides utterances of the word ''zero'' for the digit 0, we excluded the utterances of ''oh'' from our TIDIGITS sample.

Digit Classification
Based on the spiking activity of all binary digit detector neurons, a full digit classifier was implemented by ranking the digit detectors according to their individual task performances. As a result, a given stimulus was classified as the target digit of the most reliable of all responding digit detector neurons. If all neurons remained silent, a stimulus was classified as the target digit of the least reliable neuron.

Spike-Triggered Target Features
To preserve the timing relations between the learned spectrotemporal features and the target words, we refrained from correcting the spike-triggered stimuli for stimulus autocorrelations [73].

Speech Task Training
Test errors in the speech tasks were substantially reduced by training with a Gaussian spike jitter with a standard deviation of s added to the input spikes as well a symmetric threshold margin v that required the maximum postsynaptic voltage on target stimuli to exceed V thr +v and to remain below V thr 2v during null stimuli. Values of l ini , m, s, and v were optimized on a 4-dimensional grid. Because for each grid point, only short runs over maximally 200 cycles were performed, we also varied the mean values of initial Gaussian distributions of the excitatory and inhibitory synaptic peak conductances, keeping their standard deviations fixed at 0.001. The reported performances are based on the solutions that had the smallest errors fractions over the test set. If not unique, we selected the solution with the highest robustness to time warp (cf. Figure 8B). Note that this naive optimization of the training parameters did not maintain a separate holdout test set for crossvalidation and might therefore overestimate the true generalization performance. We adopted this optimization scheme from [25,26] to ensure comparability of the resulting performance measures.
Comparison to the HTK HTK generalization performance was tested with the HTK package version 3.4.1 [74] with front-end and HMM model parameters following [26]. Specifically, speech data from the TI46 and TIDIGITS databases were converted to 13 Mel-cepstral coefficients (including the 0th order coefficient) along with their first and second derivatives at a frame rate of 5 ms. Mel-coefficients were computed over 30 channels in 25-ms windows with zero mean normalization enabled (TARGET-KIND = MFCC_D_A_Z_0). In addition, the following parameters were set: USEHAMMING = T; PREEMPCOEF = 0.97; and CEPLIFTER = 22. Ten HMM models, one for each digit plus one HMM model for silence, were used. Each model consisted of five states (including the the two terminal states) with eight Gaussian mixtures per state and left-to-right (no skip) transition topology. Figure S1 Effect of h-conductance on time rescaling. Time-warp distortion index computed for random latency patterns (see Materials and Methods) versus the maximal h-conductance for different values of the mean synaptic conductance G syn g leak : 7.2 (triangles), 10.8 (squares), and 14.4 (circles). Curves were averaged over 2,000 spike-pattern realizations.