Similarity of Cortical Activity Patterns Predicts generalization Behavior

Humans and animals readily generalize previously learned knowledge to new situations. Determining similarity is critical for assigning category membership to a novel stimulus. We tested the hypothesis that category membership is initially encoded by the similarity of the activity pattern evoked by a novel stimulus to the patterns from known categories. We provide behavioral and neurophysiological evidence that activity patterns in primary auditory cortex contain sufficient information to explain behavioral categorization of novel speech sounds by rats. Our results suggest that category membership might be encoded by the similarity of the activity pattern evoked by a novel speech sound to the patterns evoked by known sounds. Categorization based on featureless pattern matching may represent a general neural mechanism for ensuring accurate generalization across sensory and cognitive systems.


Introduction
When faced with a sensory stimulus that could indicate a predator, prey, or a mate, accurate generalization is critical for survival [1]. For example, vervet monkeys learn to emit different warning calls for each class of predator in their environment, and monkeys who hear these calls exhibit distinct behaviors that indicate they understand the category that each type of call represents [2]. Humans and animals possess the remarkable ability to quickly and accurately determine how similar any image, sound, or smell is to previously learned stimuli.
The first step in categorizing a novel stimulus appears to be quantifying its similarity to known category members [3,4]. Many studies have documented the presence of a generalization gradient for stimuli varying along a single dimension. Pigeons trained to peck a colored light also respond to colors of similar wavelength [5]. Following conditioning to a tone, both humans and animals respond to tones of similar frequency and respond less as similarity decreases [6,7]. However, physical similarity does not always predict perceptual similarity even for stimuli that vary along a single dimension [8][9][10].
The similarity of real world stimuli is notoriously difficult to predict. The consonants /d/ and /t/ (as in "dad and "tad") have different voice onset times, pitch contours, formant transition durations, formant onset frequencies, F1 cutbacks, and burst intensities [11]. Male and female voices have different pitches, levels of breathiness, formant frequencies, and formant amplitudes [12]. Any of these features is sufficient to distinguish between ambiguous sounds, but none of these features is necessary to identify the phoneme or gender [13][14][15]. The so-called "lack of invariance" problem in speech perception also occurs in face perception [16,17]. Dozens of physical differences, including pupil to pupil distance, chin shape, and nose length, can be used to distinguish between faces but no single feature or set of features is required. Modern face recognition algorithms use template matching because feature-based approaches failed to support robust recognition [18].
Commercial systems for speech and music recognition have also abandoned the use of feature-based approaches [19,20], but psychophysical and neurophysiological studies continue to focus on the representation of a small set of speech features [21][22][23][24]. In this study, we test the hypothesis that the similarity of activity patterns in sensory cortex supports effective speech sound categorization without the need to compute a set of particular acoustic features. Our study provides a direct demonstration that, like face recognition, featureless template matching accounts for speech categorization performance.

Materials and Methods
Twenty-one rats were trained to categorize speech sounds by voicing or gender. We trained rats to press a lever in response to a single speech sound and refrain from lever pressing to a second speech sound. We then tested their ability to generalize to novel speech sounds. Half of the rats in our study were trained to categorize sounds based on speaker gender (female vs. male, Gender Task group), while the other half were trained to categorize speech sounds based on voicing ('dad' vs. 'tad', Voicing Task group). Behavioral performance on four generalization tasks was compared to multiunit activity recorded at 441 primary auditory cortex (A1) sites from eleven experimentally naïve rats and 903 A1 sites from twenty-one speech trained rats (female Sprague Dawley rats were obtained from Charles River Laboratories). Our datasets are freely available upon request. Behavioral training and A1 recording procedures are identical to our previous studies [25,26].

Ethics statement
This study was performed in strict accordance with the recommendations in the Guide for the Care and Use of Laboratory Animals of the National Institutes of Health. The protocol was approved by The University of Texas at Dallas Institutional Animal Care and Use Committee (Protocol Number: 99-06). All surgery was performed under sodium pentobarbital anesthesia, and every effort was made to minimize suffering.

Speech stimuli
The stimulus set for these experiments was designed so that each sound can be categorized based on (1) the gender of the speaker or (2) the voicing of the initial dental consonants (/d/ vs. /t/). We used the voiced word 'dad' and the voiceless word 'tad' spoken in isolation by 3 male and 3 female native English speakers (n = 12 sounds, Figure 1). The sound names were shortened in the figures; for example, 'DM3' refers to the sound 'dad' spoken by the 3 rd male speaker, 'TF2' refers to the sound 'tad' spoken by the 2 nd female speaker, and 'D90' refers to the sound 'dad' temporally compressed to 90% of the original stimulus length. As in our earlier studies using the same sounds, the speech sounds were shifted up by one octave using the STRAIGHT vocoder [27] in order to better match the rat hearing range [25,[28][29][30][31][32][33]. The intensity of these sounds was adjusted so that the loudest 100 ms of the vowel was 60 dB SPL. Nine temporally compressed versions of 'dad' and 'tad' spoken by a single female speaker (female 1) were generated using the STRAIGHT vocoder (n = 18 sounds). These stimuli were compressed in increments of 10% down to 10% of the original stimulus length. A version of the female 1 'dad' was also created using STRAIGHT with the pitch one octave lower for use during the discrimination training task prior to gender categorization.

Behavioral training
Our previous study demonstrated that rats can rapidly learn to discriminate English consonant pairs that differ only in their voicing, place, or manner of articulation [25]. In this study, we tested the ability of rats to categorize sets of 5, 6, 10, or 18 novel sounds based on voicing or gender.
The Voicing Task group (n = 6 rats) was trained for two weeks to press a lever in response to 'dad' and not to 'tad' spoken by female 1. After training, the rats were tested for their ability to correctly categorize eighteen temporally compressed versions of 'dad' and 'tad'. For this go/no-go task, rats were rewarded for responding to any version of 'dad' and received a brief time out for false alarming to any version of 'tad'. Following two weeks of testing on the temporal compression voicing task, the rats were tested for their ability to correctly generalize to 'dad' and 'tad' produced by five new talkers (2 female and 3 male).
The Gender Task group (n = 5 rats) was trained to lever press in response to 'dad' spoken by female 1, but not to the same word when the pitch was shifted down by one octave (F 0 of 225 Hz) using the STRAIGHT vocoder. After two weeks of pitch discrimination training, the rats were tested for their ability to categorize gender using the novel 'dad' stimuli from three male and two female speakers. Rats were rewarded for pressing in response to 'dad' spoken by a female, but received a time out for pressing in response to 'dad' spoken by a male. Following two weeks of testing on the 'dad' gender task, the rats were tested for their ability to correctly categorize gender using the three male and three female 'tad' stimuli. Rats were rewarded for lever pressing in response to 'tad' spoken by a female, but received a time out for lever pressing in response to 'tad' spoken by a male.
Training took place in double-walled booths that each contained a speaker (Optimus Bullet Horn Tweeter, Radio Shack), house light, and cage (8" length x 8" width x 8" height) with a lever and pellet dish. The pellet dispenser was mounted outside of the booth to minimize sound contamination. Rats received a 45 mg sugar pellet reward for pressing the lever in response to the target sounds, and received a time out where the house light was extinguished for a period of approximately 6 seconds for pressing the lever in response to the non-target sounds. Rats were food deprived to provide task motivation. Additional food was provided as needed to keep rats between 80% and 90% of their full feed weights.
Rats were first trained to press the lever to receive a sugar pellet reward. Each time the rat was near the lever, the rat heard the target sound and received a 45 mg sugar pellet. Pellets were then given only if the rat was touching the lever, and eventually the rat began to press the lever independently. After each lever press, the rat heard the target sound and received a pellet reward. Once they reached the criteria of independently pressing the lever 100 times per session for two sessions, they advanced to the detection phase of training. During this phase, rats from all groups learned to press the lever after hearing the 'dad' speech stimulus spoken by female 1. Rats started with an 8 second lever press window (hit window) after each sound presentation, and the hit window was decreased in 0.5 second increments every few sessions as performance increased, down to a hit window of 3 seconds. When rats reached the criteria of a d' ≥ 1.5 for 10 sessions (average of 26 ± 2 sessions), they advanced to the Neural Similarity Predicts Generalization Behavior PLOS ONE | www.plosone.org 2 October 2013 | Volume 8 | Issue 10 | e78607 discrimination task. d' is a measure based on signal detection theory of the discriminability of two sets of samples. From this phase on, rats performed each task for 20 sessions over 2 weeks (2 one-hour training sessions per day). Six rats in the Voicing Task group trained on a 'dad' vs. 'tad' discrimination task for two weeks, followed by a 'dad' vs. 'tad' temporal compression categorization task for two weeks, followed by a 'dad' vs. 'tad' multiple speaker categorization task for two weeks. Five rats in the Gender Task group trained on a 'dad' pitch discrimination task for two weeks, followed by a 'dad' gender categorization task for two weeks, followed by a 'dad' and 'tad' gender categorization task for two weeks. The final categorization task in each group used the exact same stimuli, 'dad' and 'tad' spoken by multiple male and female speakers. The Voicing Task group was trained to categorize these stimuli based on voicing, while the Gender Task group was trained to categorize these stimuli based on gender.

Anesthetized recordings
We recorded multi-unit activity (n = 441) in the right primary auditory cortex of eleven experimentally naïve female Sprague-Dawley rats in response to each of the 15 'dad' and 15 'tad' stimuli tested behaviorally. Multi-unit recordings were also collected in the right primary auditory cortex of five gender trained rats (n = 280 sites) and four voicing trained rats (n = 168 sites). Rats were initially anesthetized with pentobarbital (50 mg kg -1 ), and received dilute pentobarbital (8 mg ml -1 ) as needed. Four Parylene-coated tungsten microelectrodes (1-2 MΩ, FHC, Bowdoin, ME, United States) were used to record action potentials ~600 μm below the cortical surface. Recording sites were selected to evenly sample A1 without damaging the cortical surface vasculature.
Each speech sound was presented 20 times (randomly interleaved with a 2 second interstimulus interval). To determine the characteristic frequency of each site, 25 ms tones were presented at 81 frequencies (1 to 32 kHz) and 16

Awake recordings
We recorded multi-unit A1 responses (n = 65) in seventeen experimentally naive awake rats using chronically implanted microwire arrays, which were described in detail in previous publications [25,34]. Fourteen-channel microwire electrodes were implanted in the right primary auditory cortex using a custom-built mechanical insertion device to rapidly insert electrodes in layers 4/5 (depth ~600 µm) [34]. Recordings were made in response to the 12 'dad' and 'tad' stimuli spoken by 3 male and 3 female speakers, the 18 temporally compressed versions of 'dad' and 'tad', and the sound 'dad' spoken by female 1 with a low pitch (the non-target sound used for discrimination training prior to the gender categorization task). Awake rats were passively exposed to these speech sounds, and were not performing the categorization tasks.

Data analysis
Neurograms were constructed by arranging the responses from each of the A1 recording sites on the y axis from low characteristic frequency to high characteristic frequency sites. The neurogram response for each sound at each site is the average of 20 repeats of that sound played at that site. Neural similarity was computed using Euclidean distance. The Euclidean distance between any two activity patterns is the square root of the sum of the squared differences between the firing rates for each recording site. The onset response to each sound was defined as the 50 ms interval beginning when average neural activity across all sites exceeded the spontaneous firing rate by three standard deviations. Neurograms were temporally binned into a single 50 ms bin. The Euclidean distance was calculated between the activity pattern for a novel sound and both the activity pattern for the target sounds that the rats had previously trained on, and the activity pattern for the non-target sounds that the rats had previously trained on. For the initial gender 'dad' categorization task, the previously trained target and non-target template patterns were the response to the high pitch 'dad' and low pitch 'dad', respectively. For the initial voicing compression task, the template patterns were the response to 'dad' and 'tad' spoken by female 1. For the second gender task, gender 'tad', the template target pattern was the average of the response to the target sounds heard on the previous task (3 female exemplars of 'dad'), while the template non-target pattern was the average of the response to the non-target sounds heard on the previous task (3 male exemplars of 'dad'). For the second voicing task, the template target pattern was the average of the response to the target sounds heard on the previous task (10 compressed versions of 'dad'), while the template non-target pattern was the average of the response to the non-target sounds heard on the previous task (10 compressed versions of 'tad'). For each novel sound, the distance to the target pattern was subtracted from the distance to the non-target pattern, so that responses with positive values are more similar to the target pattern, while responses with negative values are more similar to the nontarget pattern. Pearson's correlation coefficient was used to examine the relationship between neural similarity and generalization performance on the first day of each of the categorization tasks. Our measure of neural similarity is not dependent on the Euclidean distance measure. Neural similarity quantified using City Block distance and Minkowski distance also significantly predicts generalization behavior on all four tasks. To test the importance of spectral precision, neural recordings from 441 A1 sites were binned into subsets containing 1, 2, 3, 4, 5, 7, 9, 10, 15, 20, 25, 55, 110, 220, or 441 sites (441, 220, 147, 110, 88, 63, 49, 44, 29, 22, 17, 8, 4, 2, or 1 bins, respectively). Each bin contained sites tuned to a specific range of frequencies. For example, when the data were divided into four bins, the frequency ranges were 1-6, 6-10, 10-15, and 15-31 kHz.

Rats categorize novel speech sounds by speaker gender and voicing
Rats accurately generalized to novel sounds after training to discriminate a single sound from each of two distinct categories. The Gender Task group of rats (n = 5) was first trained to discriminate the word 'dad' with a high pitch from the word 'dad' with a low pitch. Following pitch training, the rats were tested on their ability to categorize the gender of novel 'dad' sounds spoken by different male and female speakers. Rats were able to perform the task well above chance on the first day of testing (d' = 1.32 ± 0.3 mean ± se, 83 ± 4% lever press to female vs. 37 ± 10% lever press to male, p = 0.008, Figure 2a). Following two weeks of training on the 'dad' categorization task, the rats were then tested for their ability to generalize to novel 'tad' stimuli spoken by the same three male and three female speakers. The Gender Task rats were able to categorize the novel sounds by gender on the first day of testing (80 ± 8% lever press to female sounds vs. 22 ± 3% lever press to male sounds, d' = 1.76 ± 0.2, p = 0.001, Figure  2b). These results demonstrate that pitch trained rats were able to accurately categorize speech sounds by gender while ignoring differences in speaker and voicing.
Another group of rats (Voicing Task, n = 6) was trained to categorize the same sounds but was required to categorize by voicing while ignoring gender ( Figure 1). The Voicing Task group initially learned to discriminate the word 'dad' from the word 'tad', spoken by a single female speaker. These rats were then tested for their ability to categorize these sounds when temporally compressed to create a set of 9 novel 'dad' sounds and 9 novel 'tad' sounds (with durations 10 -90% of the original length). The Voicing Task rats were able to generalize to these new stimuli, and accurately categorized 16 of the 18 novel temporally compressed sounds on the first day of training (86 ± 2% lever press to 'dad' vs. 34 ± 7% lever press to 'tad', d' = 1.53 ± 0.2, p = 0.00002, Figure 2c). This same group of rats was next tested for their ability to generalize to novel 'dad' and 'tad' stimuli spoken by three male and two female speakers. This set of sounds was identical to the sounds presented to the Gender Task rats for their second generalization task. Voicing Task rats accurately categorized the novel sounds by voicing on the first day of testing (83 ± 6% lever press to 'dad' vs. 38 ± 6% lever press to 'tad', d' = 1.38 ± 0.1, p = 0.00008, Figure 2d). These results demonstrate that voicing trained rats were able to generalize to novel stimuli while ignoring significant variation in stimulus duration, speaker, or gender. We analyzed the first trial behavioral response to each new sound for each of the four tasks to confirm that categorization behavior recorded on the first day was indeed due to generalization rather than rapid learning. For the first presentation of each sound, rats pressed the lever consistently (a) Gender Task rats successfully generalized from the pitch discrimination task, and accurately pressed the lever more often in response to novel female 'dad' sounds than novel male 'dad' sounds on the first day of testing. Red symbols represent target sounds, blue symbols represent nontarget sounds, and black symbols represent target or non-target sounds from the previous task. Circle symbols indicate 'dad' stimuli, while triangle symbols indicate 'tad' stimuli. Error bars indicate s.e.m. across rats. The solid line indicates average percent lever press to silent catch trials, with s.e.m. indicated by the dotted lines. (b) Gender Task rats successfully generalized from the gender 'dad' categorization task, and accurately pressed the lever more often in response to novel female 'tad' sounds than novel male 'tad' sounds on the first day of testing. The sounds presented in subplot d are identical. (c) Voicing Task rats successfully generalized from the voicing discrimination task, and accurately pressed the lever more often in response to novel temporally compressed 'dad' than novel temporally compressed 'tad'. (d) Voicing Task rats successfully generalized from the voicing temporal compression categorization task, and accurately pressed the lever more often in response to 'dad' spoken by multiple novel speakers than 'tad' spoken by multiple novel speakers. more often in response to sounds in the target category compared to sounds in the non-target category (average of 72±4% target lever press vs. 44±6% non-target lever press, p = 0.0005). These results confirm that rats are able to accurately categorize novel sounds based on experience with as few as one member of each category.

Simple acoustic features cannot fully explain gender and voicing categorization by rats
Historically, speech scientists concluded that each speech category is defined by a set of acoustic features such as pitch, formant frequencies, or voice onset time [35,36]. We measured multiple acoustic features for each of the trained sounds (Table  1), and our results confirm that these features are correlated with generalization performance in rats. The pitch (fundamental frequency, F0), first formant peak and second formant peak of each sound are positively correlated with categorization as a female sound by rats (F0: R 2 = 0.73, p = 0.0004; F1: R 2 = 0.41, p = 0.03; F2: R 2 = 0.35, p = 0.04, for both gender tasks). These cues are also correlated with gender judgments by humans [36,37].
Multiple acoustic features are correlated with generalization performance in the Voicing Task rats. Voice onset time (VOT) and burst duration (the duration of the stop consonant release burst) are both correlated with categorization as an unvoiced consonant by rats (VOT R 2 = 0.6, p = 0.0001 voicing compression task; VOT R 2 = 0.75, p = 0.0002 voicing multiple speaker task; Burst duration R 2 = 0.46, p = 0.001 voicing compression task; Burst duration R 2 = 0.67, p = 0.001 voicing multiple speaker task; Table 1). These acoustic cues also predict voicing categorization in humans [35,36]. Previous literature, however, demonstrates that simple acoustic parameters, such VOT and pitch, cannot explain speech perception, especially in difficult listening conditions. Studies in humans and rats have clearly demonstrated that behavioral performance is preserved when background noise or degradation by noise vocoder are used to eliminate VOT, formant, and pitch cues [15,30,31,38,39].
Our behavioral results suggest that the rats do not use a single acoustic feature to accurately categorize sounds by voicing or gender. The behavioral results were inconsistent with the prediction that rats use pitch (fundamental frequency, F0) to discriminate female from male speakers. Rats reliably categorized 'tad' spoken by one of the male speakers (TM1) as a sound spoken by a female even though the pitch was 117 Hz. If the rats categorized the sound based on pitch, they would have been expected to respond as if it was one of the male sounds ( The behavioral results were inconsistent with the prediction that rats use VOT to discriminate 'dad' from 'tad'. Previous studies have shown that humans and rodents categorize sounds as voiced when they have a VOT of less than 35 ms [9]. After our rats were trained to lever press to 'dad' (VOT = 19 ms) and not to 'tad' (VOT = 79 ms), the rats were tested on versions of these sounds that were temporally compressed such that their VOTs were shortened to 10 to 90% of their initial durations (i.e. 2 to 71 ms, see Table 1). Rats accurately rejected compressed forms of 'tad' even when the VOT was much lower than 35 ms (T20 -T40 in Table 1). Importantly, the rats continued to accurately reject compressed 'tad' sounds even when the VOT was below the value for the trained target 'dad' sound (19 ms). These behavioral responses occurred on the first presentation of these novel sounds, which proves that the categorical boundary was not shifted by experience. If the rats categorized the sounds based on a single acoustic feature, it would be expected that they would respond (i.e. lever press) to any stimulus with a VOT less than 20 ms. However, the rats failed to respond to the 'tad' with a VOT of 16 ms (because it was compressed to 20% of the original duration). The fact that the rats continued to reliably press to the 19 ms 'dad' demonstrates that they do not categorize the sounds based on a simple measure of acoustic similarity, such as VOT. The acoustic cues pitch (F0), formant frequencies F1 and F2, VOT, and burst duration were quantified for each sound using Praat [96] and WaveSurfer [97] software. Please note that the values of F0, F1, and F2 that the rats heard were twice the values listed in the The single acoustic features analyzed here (for example, F0 or VOT) cannot fully explain the response errors; however, combinations of acoustic features may have the potential of accounting for the behavioral results [40]. Based on our previous findings that the similarity of speech evoked spatiotemporal activity patterns was correlated with discrimination ability [25,26,30,31], we predicted that neural similarity would be able to explain generalization behavior. Neural similarity provides a single, biologically plausible method to explain speech sound categorization without the need to propose multiple specialized features for each speech contrast.

Neural activity pattern similarity explains gender and voicing categorization
We predicted that rats would compare the neural pattern of activity evoked by each novel sound with stored templates of the target and non-target sounds. Rats made generalization errors more often for some sounds than others, and these errors were well explained by comparing the activity pattern for those sounds with the template patterns. For each task, neural similarity was calculated between the activity pattern for a novel sound and the average activity patterns for the target and non-target sounds from the previous task ( Figure 3, Figure S1, and Methods). For the first gender generalization task, rats had previously been trained to discriminate the word 'dad' with a high pitch from the word 'dad' with a low pitch. Since the rats only had experience with the two 'dad' sounds, the stored target template for the first gender generalization task was the activity pattern in response to the word 'dad' with a high pitch, and the stored non-target template was the activity pattern in response to the word 'dad' with a low pitch. For the second gender generalization task, the rats had experience with 'dad' spoken by 3 female speakers and 3 male speakers. The stored target template for the second gender generalization task was the average activity pattern in response to 'dad' spoken by the 3 female speakers, while the stored non-target template was the average activity pattern in response to 'dad' spoken by the 3 male speakers.
The pattern of generalization errors on the gender tasks was well explained by the similarity of the activity patterns evoked by each of the novel sounds to the patterns evoked by each of the trained sounds. As we predicted, rats were most likely to make generalization errors in response to the novel sounds which evoked neural activity patterns that were intermediate between the patterns evoked by the target and non-target sounds. We used a Euclidean distance metric to quantify the similarity of primary auditory cortex responses. Response patterns consisted of the onset response from 441 multiunit A1 sites from 11 anesthetized experimentally naive rats. As predicted, neural similarity between the novel sound and the trained sounds was strongly correlated with generalization performance for both gender generalization tasks (R 2 = 0.92, p = 0.009 novel 'dad' sounds; R 2 = 0.94, p = 0.001 novel 'tad' sounds, Figures 3 & 4a,b). These findings support our hypothesis that neural similarity provides a biologically plausible metric of perceptual similarity.
Generalization errors were well explained by comparing the neural response pattern evoked by each of the novel sounds to the patterns evoked by the trained sounds. For example, during the second gender task, rats frequently incorrectly pressed the lever for the 'tad' spoken by male 1 (TM1, Figure 4b). Based solely on the acoustic feature pitch, the rats should have responded as though the sound was male instead of female (see Acoustic features section above). This error is well explained using neural similarity, where the sound more closely resembles the female template compared to the male template (Figure 4b). By examining the neurogram for this sound ( Figure  3), it is clear that the sound evokes a strong high frequency response, which makes the response more closely resemble the female sounds (which also evoke a strong high frequency response) compared to the other male sounds (which evoke a weak high frequency response).
The pattern of generalization errors on the voicing tasks was well explained by the similarity of the activity patterns evoked by each of the novel sounds to the patterns evoked by each of the trained sounds. As we predicted, neural similarity between the novel sound and the trained sounds was correlated with generalization performance for both voicing generalization tasks (R 2 = 0.82, p < 0.0001 voicing compression task; R 2 = 0.76, p = 0.0009 voicing multiple speaker task, Figures 3 &  4c,d). As seen for the gender tasks, generalization errors on the voicing tasks were well explained by the neural responses. Rats frequently incorrectly responded to the most compressed versions of 'tad' as though they were 'dad' (Figure 4c). The generalization errors to the most compressed versions of 'tad' can be explained by neural responses but are not well explained by the acoustic feature voice onset time. Results from the Voicing Task group confirm our hypothesis that novel speech sounds are assigned to the speech category whose members generate an average activity pattern that most closely resembles the activity pattern evoked by the novel sound. This finding is consistent with earlier predictions that have never been tested. In the natural world, humans and animals generally have experience with more than one exemplar per category. The similarity-based prototype model proposes that a category prototype is the most typical member of the category [41]. An extension of this model proposes that instead of category prototypes being the best examples from their categories, prototypes are an abstraction composed of the average category member [42]. As we predicted, rats with previous categorization experience appear to store templates of the target and non-target sounds based on the average neural responses evoked by the sounds they have experienced, and compare the pattern of activity evoked by each novel sound in the new task to these stored average templates.

Generalization performance is not well correlated with spectrogram similarity
For each task, spectrogram similarity was calculated between the power spectrum for a novel sound and the average power spectrums for the target and non-target sounds from the previous task. The Euclidean distances between the spectrograms of the speech onsets ( Figure 5). The first 45 ms of the spectrograms were used to match the 50 ms neural analysis window, excluding 5 ms to account for minimum neural delay. A similar pattern of correlation was observed across a wide range of analysis windows. Analysis of the onset power spectrum alone is not able to accurately predict generalization behavior for the four tasks because spectral analysis is only influenced by spectral energy and does not take into account the temporal characteristics of the acoustic energy or the neural response properties. Thus, it is perhaps not surprising that neural analysis more accurately predicts behavior.

Responses from trained rats are correlated with generalization performance
Previous studies have documented primary sensory cortex plasticity following categorization training [43]. We tested Figure 3. Neurograms depicting the onset response of rat A1 neurons to speech sounds. Multi-unit data was collected from 441 recording sites in eleven anesthetized experimentally naïve adult rats. Average post-stimulus time histograms (PSTH) derived from twenty repeats were ordered by the characteristic frequency (kHz) of each recording site (y axis). Time is represented on the x axis (-5 to 50 ms). The firing rate of each site is represented in grayscale, where black indicates 450 spikes/s. For comparison, the mean population PSTH evoked by each sound is plotted above the corresponding neurogram. To facilitate comparison between the naïve and trained responses, the mean PSTH y axis is set to 450 Hz for all neurogram figures. For naïve rats, 'tad' female #3 evokes the maximum peak firing rate (351 Hz) across the twelve sounds. As in Figure 1, rows differ in voicing (top row is 'dad', bottom row is 'tad'), while columns differ in gender (left three columns are female, right three columns are male).  The neural similarity between each novel sound and the template sounds is correlated with generalization performance on the gender 'tad' task. (c) The neural similarity between the response pattern for each novel sound and the response pattern for each of the two template sounds is correlated with generalization performance on the voicing temporal compression categorization task. (d) The neural similarity between each novel sound and the template sounds is correlated with generalization performance on the voicing multiple speaker task. also not increased in trained rats compared to naïve control rats (1.9 ± 0.3 spikes in trained rats vs. 1.7 ± 0.2 spikes in naïve rats, p = 0.53), but the response strength to tones was decreased in trained rats (2.1 ± 0.1 spikes in trained rats vs. 3 ± 0.3 spikes in naïve rats, p = 0.005). The onset latency to the trained sounds in both trained groups did not change compared to naïve controls (11.1 ± 0.7 ms in trained rats vs. 11.5 ± 0.6 ms in naïve rats, p = 0.64). While these results show that categorization training does not enhance auditory cortex response strength, it does not rule out that plasticity plays a role in generalization performance. To determine if auditory cortex plasticity enhanced the distinction between sounds from different categories, we compared the correlation between neural similarity and generalization performance using neural responses collected from voicing and gender trained rats ( Figure S2). If auditory cortex plasticity is required in order to accurately predict performance, we would have expected a stronger correlation between neural similarity and generalization performance using neural responses collected from trained compared to naïve rats. The neural Euclidean distance (spectral similarity) between the spectrogram for each novel sound and the spectrogram for each of the two template sounds is weakly correlated with generalization performance on the gender 'dad' task. Positive values are more similar to the target template, while negative values are more similar to the non-target template. The sound name abbreviation is printed next to each data point, see Methods. Solid lines indicate the best linear fit. (b) The spectral similarity between each novel sound and the template sounds is weakly correlated with generalization performance on the gender 'tad' task. (c) The spectral similarity between the spectrogram for each novel sound and the spectrogram for each of the two template sounds is weakly correlated with generalization performance on the voicing temporal compression categorization task. (d) The spectral similarity between each novel sound and the template sounds is weakly correlated with generalization performance on the voicing multiple speaker task.  Figure S3a,b) and both voicing categorization tasks (R 2 = 0.71, p < 0.0001, voicing compression task; R 2 = 0.58, p = 0.01, voicing multiple speaker task; Figure S3c,d). Neural similarity was highly correlated with generalization performance on each of the four tasks whether the neural responses were recorded in naïve or trained rats (naïve average R 2 = 0.86, p < 0.01; trained average R 2 = 0.79, p < 0.02). This result is consistent with earlier reports that speech sounds evoke distinct neural patterns before training begins [25,26,30,31,46]. The average Euclidean distance between stimuli from different categories was not increased in trained rats compared to naïve rats (p > 0.05). Our observation suggests that changes in A1 are not responsible for improved performance (see Text S1). Previous studies have detailed the complexity of traininginduced plasticity, which is dependent on both the auditory field and the time course of training. Birds trained to discriminate songs have shown either an increase or a decrease in the response strength to familiar songs compared to unfamiliar songs depending on the auditory field [44,47]. Earlier studies have also reported improved categorization in the absence of plasticity in primary sensory cortex [48][49][50][51][52]. Training induced map plasticity in A1 can later return to a normal topography without negatively impacting behavioral performance [49,53]. Improved performance may result from changes in higher cortical fields, such as the superior temporal gyrus or prefrontal cortex, that exhibit categorical responses to speech sounds [54][55][56][57].

Analysis of categorization by different neural subpopulations
The patterns of neural activity evoked by each of the sounds suggest that gender differences are encoded in the onset response of high frequency neurons, while voicing differences are encoded in the onset response of low frequency neurons (Figures 3, 6 and S4). For the gender tasks, sounds spoken by a female evoked 207% more spikes than sounds spoken by a male in high frequency neurons between 16 and 32 kHz (p < 0.0001, Figure 6a,b), but there was no significant difference in the firing rate in low frequency neurons between 1 and 2 kHz (p = 0.66). In contrast to gender firing differences, 'dad' sounds evoked 302% more spikes than 'tad' sounds in low frequency neurons between 1 and 2 kHz (p < 0.0001, Figure 6c,d), but there was a much smaller difference in the firing rate in high frequency neurons between 16 and 32 kHz (16% fewer spikes, p = 0.05). This finding contrasts with earlier reports suggesting that voicing is encoded in the temporal interval between two activity peaks [22], and pitch is encoded in low frequency neurons [21]. Our results suggest that the spatial activity pattern can be used to accurately categorize these speech sounds.
The temporal activity pattern also contains information that can be used to accurately categorize the sounds by voicing or gender. For the gender tasks, sounds spoken by a female evoked 45% more spikes than sounds spoken by a male in neurons responding to a tone faster than 10 ms (< 0.0001, Figure 7a,b and Figure S5), but there was no significant difference in the firing rate in neurons responding slower than 13 ms. In contrast to gender firing differences, 'dad' sounds evoked 28% more spikes than 'tad' sounds in neurons responding to a tone slower than 13 ms (p = 0.001, Figure 7c,d and Figure S5), but there was no significant difference in the firing rate in neurons responding faster than 10 ms. Our results suggest that both the spatial and the temporal activity pattern can be used to accurately categorize these speech sounds.
There are many potential methods to compute the similarity between neural response patterns that accurately predict generalization performance. Neural similarity was highly correlated with generalization performance for all four tasks whether Euclidean, City Block, or Minkowski distance metrics were used (R 2 > 0.73, p < 0.03). The correlation remains high if the window used to quantify the neural response ends 30 to 120 ms after sound onset (R 2 > 0.51, p < 0.03, Figure S6a and Figure S7). Neural similarity is only correlated with generalization performance when the onset response is included in the analysis window (p < 0.05, Figure S6b). This finding is consistent with classic studies showing speech sounds can be accurately categorized using only the initial few tens of milliseconds [58,59]. Although our initial analyses considered the neural responses of each A1 recording site separately, to determine the amount of spectral precision that is necessary, we divided sites into bins that were tuned to specific characteristic frequency ranges. The correlation between generalization performance and neural similarity remains high even if the sites are binned by characteristic frequency into as few as two bins (R 2 > 0.61, p < 0.01) [30]. The consistency of our results across a wide range of parameters supports our hypothesis that the similarity to previously learned patterns is used to categorize novel stimuli. These results are consistent with recent imaging results that even neural metrics with poor spatial and temporal precision can be well correlated with categorization performance [60,61].
Neural similarity accurately predicts generalization performance using both awake and anesthetized neural responses. The correlation between neural similarity and generalization performance using neural responses from experimentally naïve awake rats was strong for the gender 'dad' task (R 2 = 0.89, p = 0.02), the voicing temporal compression task (R 2 = 0.61, p = 0.0001), the gender 'tad' task (R 2 = 0.78, p = 0.02), and the voicing multiple speaker task (R 2 = 0.44, p = 0.04). This result strengthens our finding in experimentally naïve anesthetized rats that auditory cortex plasticity is not required to predict generalization performance. Using both anesthetized and awake responses, we examined how large of a neural population must be sampled to accurately estimate neural similarity and generalization performance. Given the great diversity of response properties in A1 [62], we expected that a large sample size might be necessary. We randomly selected groups of 1, 2, 5, 10, 20, 50, 100, 200, 300, or 441 anesthetized A1 sites and randomly selected groups of 1, 2, 5, 10, 20, 30, 35, 50, 60, or 65 awake A1 sites to determine the minimum population size required in order to predict generalization. We found that the correlation between Neural Similarity Predicts Generalization Behavior PLOS ONE | www.plosone.org 11 October 2013 | Volume 8 | Issue 10 | e78607 neural similarity and generalization performance becomes significant when more than 20 randomly selected multi-unit clusters were used to estimate each neural activity pattern and asymptotes at approximately 100 (p < 0.05, Figure 8). Neural similarity was not better correlated with generalization behavior when A1 neurons were selected to maximize the difference in the evoked responses. Selecting subpopulations also did not reduce the number of A1 sites needed to generate a significant correlation. For example, when A1 sites with low frequency tuning (< 8 kHz, Figure 6) and long latency (>13 ms, Figure 7) were used to compute neural similarity and compared with performance on the voicing task, approximately the same number of sites were required to generate a similar correlation coefficient compared to neural similarity based on a randomly selected set of A1 sites. The consequence was the same when subpopulations were used that generated the maximum response difference for the gender task (i.e. high frequencies and short latencies). These results confirm earlier observations that population responses most accurately reflect behavioral ability [25,63]. There is now strong evidence that the degree of abstraction increases with distance from the receptor surface (e.g. cochlea) and that categorization is the result of neural processing distributed across many brain regions [64]. Figure 6. Peak firing rate differences in high and low frequency neurons for gender and voicing distinctions. Peak firing rate for target and non-target sounds differs in high frequency neurons for gender distinctions, and differs in low frequency neurons for voicing distinctions. (a) For the gender task using 'dad' stimuli, target female 'dad' sounds evoke a larger response in high frequency neurons compared to non-target male 'dad' sounds. Each of the 441 A1 recording sites from experimentally naïve rats were binned by characteristic frequency into one of five bins each spanning one octave. Error bars indicate s.e.m. across each of the sounds. (b) For the gender task using 'tad' stimuli, target female 'tad' sounds evoke a larger response in high frequency neurons compared to non-target male 'tad' sounds. (c) For the voicing temporal compression task, target 'dad' sounds evoke a larger response in low frequency neurons compared to non-target 'tad' sounds. (d) For the voicing multiple speaker task, target 'dad' sounds evoke a larger response in low frequency neurons compared to non-target 'tad' sounds. doi: 10.1371/journal.pone.0078607.g006

Discussion
We tested the hypothesis that the similarity between neural activity patterns predicts speech sound generalization without the need to compute multiple acoustic features. Speech sounds are widely believed to be categorized based on the integration of dozens of acoustic features. At least sixteen features have been proposed to contribute to differences in voicing, including voice onset time, pitch contour, burst intensity, and F1 cutback [11]. Separate sets of acoustic features can be used to distinguish between speech sounds differing in gender, place of articulation, vowel, or frication [12,37,[65][66][67][68][69]. While any of these features is sufficient to categorize a speech sound, no particular acoustic difference is required to accurately categorize a sound [13,14]. Our results from four voicing or gender speech categorization tasks suggest that template matching in the brain can account for the classic "lack of invariance" of speech perception without requiring storage and analysis of the relationship between a large number of discrete features. Our study failed to find evidence of neurons tuned exclusively to one acoustic feature of speech sound (i.e. VOT or pitch). This result is consistent with a recent study demonstrating that responses in auditory cortex neurons can be influenced by multiple acoustic features of speech sounds [70].
The behavioral and physiological results from our study confirm and extend findings from earlier studies [40]. Our observation that rats trained to discriminate sounds based on voicing or gender can accurately categorize novel sounds even on the first presentation confirms previous studies showing that animals can categorize sounds based on voicing or gender differences [9,51,71]. The neural responses collected in this study are similar to earlier reports of speech sound responses in humans and animals [22,51,72,73]. (a) For the gender task using 'dad' stimuli, target female 'dad' sounds evoke a larger response in fast neurons that respond to tones in less than 10 ms compared to non-target male 'dad' sounds. Each of the 441 A1 recording sites from experimentally naïve rats were binned by onset latency into one of five bins each spanning one millisecond. Error bars indicate s.e.m. across each of the sounds. (b) For the gender task using 'tad' stimuli, target female 'tad' sounds evoke a larger response in fast neurons compared to non-target male 'tad' sounds. (c) For the voicing temporal compression task, target 'dad' sounds evoke a larger response in slow neurons that respond to tone slower than 13 ms compared to non-target 'tad' sounds. (d) For the voicing multiple speaker task, target 'dad' sounds evoke a larger response in slow neurons compared to non-target 'tad' sounds. Our demonstration that speech perception can be explained without explicit extraction of specialized acoustic features closely parallels recent advances in face processing which no longer relies on the computation of features such as pupil to pupil distance, nose length, or chin shape. Instead it appears that biological systems and more effective artificial systems represent the visual input as activity among a large diverse set of broadly tuned filters and categorize novel inputs based on their similarity to stored templates. Importantly, there is no need to extract any particular features. Recent software applications use a similar featureless template-based method to allow for identification of millions of songs based on poor quality versions sung, whistled, hummed, or played by amateurs [19,20,74].
Our results are consistent with other studies of category formation in other modalities [57,[75][76][77][78][79][80]. Previous studies have shown that there is a gradual transformation of sensory information to a category decision through the ascending somatosensory, visual, and auditory pathways [75,76,81]. The earlier stages of sensory processing are driven by physical properties. Responses in primary sensory cortex are more abstract and are often shaped by multiple feature combinations [82][83][84][85]. Higher cortical fields are shaped by behavioral requirements and neurons become more sensitive to the meaning of stimuli and less sensitive to changes in physical characteristics that are irrelevant to category membership. Neurons in prefrontal cortex exhibit strong category selectivity and likely contribute to the behavioral response (i.e. motor output) [8,76,86,87].
Speech responses in inferior colliculus are strongly influenced by physical features, while responses in A1 are more abstract [28,60,79]. Responses in higher auditory fields Figure 8. Average percent of variance explained across the four generalization tasks using awake and anesthetized responses. Percent of variance explained (R 2 ) increases as the population size increases. Neural similarity using the onset activity pattern from individual anesthetized (black line) or awake (gray line) multi-unit sites was best correlated with behavior when more than 20 sites were used. Error bars indicate s.e.m. across the four tasks. doi: 10.1371/journal.pone.0078607.g008 follow different processing streams that extract different features from speech [32,53,81,[88][89][90]. For example, anterior auditory field is responsible for categorization based on temporal properties and posterior auditory field is responsible for categorization based on spatial location [88]. Macaques trained to discriminate between human speech sounds have neurons in the superior temporal gyrus and prefrontal cortex that respond categorically to the trained sounds [56,57]. Prefrontal neurons (but not the superior temporal gyrus neurons) are modulated by the monkeys' behavioral responses, which confirms that speech categories result from the gradual transformation of acoustic information across multiple brain regions.
We do not believe that categorization takes place in A1. Our results are consistent with earlier theoretical studies showing that categorical responses can be created from the activity patterns observed in sensory cortex [54,[91][92][93]. For example, a biologically plausible model of A1 neurons can categorize speech sounds and correctly generalize to novel stimuli [54]. These theoretical studies combined with our neurophysiology study suggest a potential biological mechanism for generalization, which has been described as "the most fundamental problem confronting learning theory" [94].
Based on our observation that neural similarity can accurately predict categorization on four auditory generalization tasks, we propose that speech sound generalization results from assigning novel stimuli to the category of stimuli that evokes the most similar activity pattern. Animal studies provide the opportunity to carefully control the sensory experience of the subjects and to precisely manipulate neural function. Artificial stimuli produced by the Klatt speech synthesizer could be used to explore the co-variation between the acoustic features which were not varied systematically in this study [12]. It would be interesting to determine how well neural similarity predicts generalization behavior 1) in the face of greater variability among stimuli from the same category, 2) for categories of stimuli involving multiple modalities, and 3) for more abstract cognitive categories. It would also be interesting to relate behavioral reaction time and neural similarity by using nose poke withdrawal to more accurately measure reaction time. Patterned optogenetic stimulation could be used to directly test whether the activity patterns observed in our study are sufficient for speech sound categorization [95]. Simultaneous multichannel recordings in awake behaving animals would make it possible to relate neural correlation patterns to behavior. Recording, lesion and microstimulation experiments in A1 and higher regions are needed to further evaluate our hypothesis that neural response similarity is responsible for the remarkable ability of humans and animals to rapidly and accurately generalize from small training sets. Figure S1. Neural similarity between two novel sounds and the trained target and trained non-target. Multi-unit data was collected from 441 recording sites (x axis) in eleven anesthetized rats and is ordered by the characteristic frequency (kHz) of each recording site. The number of spikes fired in response to each sound during the first 50 ms of the response is represented on the y axis. (a) The response to the known target sound (red, 'dad' spoken by female #1) and (b) known non-target sound (blue, 'tad' spoken by female #1). (c) The response to a novel 'dad' sound and a novel 'tad' sound (d). Both sounds were spoken by female #1 and temporally compressed by 50%. (e-h) The response pattern difference between the novel 'dad' sound and the target (e) and nontarget sounds (f), and the novel 'tad' sound and the target (g) and non-target sounds (h). The difference between the novel 'dad' and the target (e, 309) was smaller than the difference between the novel 'dad' and the non-target (f, 536), indicating that the novel 'dad' and the target are more similar. The difference between the novel 'tad' and the non-target (h, 267) was smaller than the difference between the novel 'tad' and the target (g, 535), indicating that the novel 'tad' and the non-target are more similar. (PDF) Figure S2. Neurograms depicting the onset response of gender trained and voicing trained rat A1 neurons. (a) Multi-unit data was collected from 280 recording sites in five anesthetized gender trained rats. Average post-stimulus time histograms (PSTH) derived from twenty repeats were ordered by the characteristic frequency (kHz) of each recording site (y axis). Time is represented on the x axis (-5 to 50 ms). The firing rate of each site is represented in grayscale, where black indicates 450 spikes/s. For comparison, the mean population PSTH evoked by each sound is plotted above the corresponding neurogram. To facilitate comparison between the naïve and trained responses, the mean PSTH y axis is set to 450 Hz for all neurogram figures. For gender trained rats, 'tad' female #3 evokes the maximum peak firing rate (330 Hz) across the twelve sounds. As in Figure 1, rows differ in voicing (top row is 'dad', bottom row is 'tad'), while columns differ in gender (left three columns are female, right three columns are male). (b) Neurograms depicting the onset response of voicing trained rat A1 neurons to each of the twelve sounds shown in Figure 1. Multi-unit data was collected from 168 recording sites in four anesthetized voicing trained rats. For voicing trained rats, 'tad' female #2 evokes the maximum peak firing rate (414 Hz) across the twelve sounds. (PDF) Figure S3.

Supporting Information
Neural correlates of generalization performance using neural responses from gender and voicing trained rats. (a) The normalized Euclidean distance (neural similarity) between the response pattern for each novel sound and the response pattern for each of the two template sounds is correlated with generalization performance on the gender 'dad' task. Positive values are more similar to the target template, while negative values are more similar to the nontarget template. Red symbols represent target sounds and blue symbols represent non-target sounds. Circle symbols indicate 'dad' stimuli, while triangle symbols indicate 'tad' stimuli. The sound name abbreviation is printed next to each data point, see Methods. Solid lines indicate the best linear fit. (b) The neural similarity between each novel sound and the template sounds is correlated with generalization performance on the gender 'tad' task. (c) The neural similarity between each novel sound and the template sounds is correlated with generalization performance on the voicing temporal compression task. (d) The neural similarity between each novel sound and the template sounds is correlated with generalization performance on the voicing multiple speaker task. (PDF) Figure S4. Peak firing rate differences in high and low frequency neurons for gender and voicing distinctions. (a) For the gender task using 'dad' stimuli, target female 'dad' sounds (red line) evoke a larger response in high frequency neurons compared to non-target male 'dad' sounds (blue line). Each of the 280 A1 recording sites from gender trained rats were binned by characteristic frequency into one of five bins each spanning one octave. Error bars indicate s.e.m. across each of the sounds. (b) For the gender task using 'tad' stimuli, target female 'tad' sounds evoke a larger response in high frequency neurons compared to non-target male 'tad' sounds. (c) For the voicing temporal compression task, target 'dad' sounds evoke a larger response in low frequency neurons compared to non-target 'tad' sounds. Each of the 168 A1 recording sites from voicing trained rats were binned by characteristic frequency into one of five bins each spanning one octave. (d) For the voicing multiple speaker task, target 'dad' sounds evoke a larger response in low frequency neurons compared to non-target 'tad' sounds. (PDF) Figure S5. The percentage of sites responding at different onset latencies. Each of the 441 A1 recording sites from experimentally naïve rats were binned by onset latency in response to tones. Sites were binned into one of five bins: sites responding faster than 10 ms, between 10 -11 ms, 11-12 ms, 12-13 ms, or slower than 13 ms. (PDF) Figure S6. Average percent of variance explained (R 2 ) in anesthetized animals across the four generalization tasks using varying response windows. (a) The average R 2 across the 4 generalization tasks using a 30 -120 ms neural response analysis window is significantly correlated with generalization performance. Filled symbols indicate statistically significant correlations between neural similarity and behavior. (b) The average R 2 across the 4 generalization tasks using a 50 ms analysis window with a varying start time. The correlation is strongest using the onset response information. (PDF) Figure S7. Average percent of variance explained (R 2 ) in awake animals across the four generalization tasks using varying response windows. The average R 2 across the 4 generalization tasks using a 50 -60 ms neural response analysis window in awake animals is significantly correlated with generalization performance. Filled symbols indicate statistically significant correlations between neural similarity and behavior. (PDF) Text S1. (DOC)