Classifying sex and strain from mouse ultrasonic vocalizations using deep learning

doi:10.1371/journal.pcbi.1007918

Fig 1.

Recording and classifying mouse vocalizations.

A Mouse vocalization were recorded from a pair of mice, in which one was awake, while the other was anesthetized, allowing an unambiguous attribution of the recorded vocalizations. B Vocalization from male and female mice (recorded in separate sessions) share a lot of properties, while differing in others. The present samples were picked at random and indicate that differences exist, while other samples would look more similar. C Vocalizations were automatically segmented using a set of filtering and selection criteria (see Methods for details), leading to a total set of 10055 vocalizations. D We aimed to estimate the properties and the sex of its emitter for individual vocalizations. First, the ground truth for the properties were established by a human classifier. We next estimated 3 relations, Spectrogram-Properties, Properties-Sex and Spectrogram-Sex directly, using both a Deep Neural Network (DNN), support vector machines (SVM) and regularized linear regression (LR). E The properties attributed manually to individual vocalizations could take different values (rows, red number in each subpanel), illustrated here for a subset of the properties (columns). See Methods for a detailed list and description of the properties.

More »

Expand

Fig 2.

Basic sex-dependent differences between vocalizations.

(A-I) We quantified a range of properties for single vocalizations (see Methods for details) and compared them across the sexes (blue: male, red: female). Most properties exhibited significant differences in median between the sexes (Wilcoxon rank sum test), except for the average frequency (B) and directionality (D). However, given the distribution of the data (box-plots, left in each panel), the variability across each property nonetheless makes it hard to use individual properties of determining the sex of the emitter. The graphs on the right for each color in each panel, show the mean and SEM. In G-I, only few USVs have values different than 0, hence the box-plots are sitting at 0. (J-M) Dimensionality reduction can reveal more complex relationships between multiple properties. We computed principal components (PCA) and t-statistic stochastic neighborhood embedding (t-SNE) for both the features (J/L) and the spectrograms (K/M). In particular, feature-based t-SNE (L) obtained some interesting groupings, which did, however, not separate well between the sexes (red vs. blue, see Fig 7 for more details). Each dot represents a single vocalization, after dimensionality reduction. Axes are left unlabelled, since they represent a combination of properties.

More »

Expand

Fig 3.

Deep neural network reliably determines the emitter's sex from individual vocalizations.

A We trained a deep neural network (DNN) with 6 convolutional and 3 fully connected layers to classify the sex of emitter from the spectrogram of the vocalization. B The network's performance on the (Female #1) test set rapidly improved to an asymptote of ~80% (dark red), clearly exceeding chance. Correspondingly the change in the network's weights (light red) progressively decreased after stabilizing after ~6k iterations. Data shown for a representative training run. C The shape of the input fields in the first convolutional layer became more reminiscent of the tuning curves in the auditory system [50,51]. Samples are representatively chosen among the entire set of 256 units in this layer. D The average performance of the DNN (cross-validation across animals) was 76.7±6.6%, which did not differ significantly between male and female vocalizations (p>0.05, Wilcoxon rank sum test). E The DNN performance by far exceeded the performance of ridge regression (regularized linear regression, blue, 50.7±1.6%) and support vector machines (SVM, green, 56.2±1.9%). Bars in light colors show the corresponding estimation with randomized labels, which are all at chance level (gray line). F The performance by the DNN was not only limited by the properties of the spectrograms (e.g. background noise, sampling, etc.) since a DNN trained on the number of breaks (right bars) performed significantly better. This control shows that the identical set of stimuli can be better separated on a simpler (but also binary) task. Light bars again show performance on randomized labels.

More »

Expand

Fig 4.

Features alone are insufficient to explain the DNN classification performance.

A Features of individual vocalizations can also be measured using dedicated convolutional DNNs, one per feature, with identical architecture as for sex classification (see Fig 3A). B-E Classification performance for different properties was robust, ranging between 57.0 and 82.0% on average (maroon) and depending on the individual value of each property (red). We trained networks for direction ({-1,0,1}, B), the number of breaks ({0–3}, C), the number of peaks ({0–3}, D) and the degree of broadband activation ([0,1], E). For the other 2 properties (complex and tremolo), most values were close to 0 and thus networks did not have sufficient training data for these. The light gray lines indicate chance performance, which depends on the number of choices for each property. The light blue bars indicate the distributions of values, also in %. F Using a non-convolutional DNN, we investigated how predictable features alone would be, i.e. without any information about the precise spectral structure of each vocalization. G Prediction performance was above chance (maroon, 59.6±3.0%) but less than the prediction of sex on the basis of the raw spectrograms (see Fig 3). The gray line indicates chance performance. H Feature-based prediction of sex with DNNs performed similarly compared to ridge regresson (blue) and SVM (red, see main text for statistics). I Duration, volume and the level of broadband activation were the most significant linear predictors for sex, when using ridge regression. J Using a semi-convolutional DNN, we investigated the combined predictability of the same features as above, plus 3 statistics of the stimulus (each a vector), i.e. the marginal of the spectrogram in time and frequency, as well as the spectral line, i.e. the sequence of frequencies of maximal amplitude per time-bin. K The average performance of the semi-convolutional DNN (64.5%) stays substantially lower than the 2D cDNN (see Fig 3D). USVs of both sexes were predicted with similar accuracy. L The average performance of the semi-convolutional DNN is not significantly larger than ridge regression (61.9%) or SVM (62.7%) on the same data, due to the large variability across the sexes (see Panel K).

More »

Expand

Fig 5.

Deep neural network partly determines the emitter's strain from individual vocalizations.

We trained a deep neural network (DNN) with 6 convolutional and 3 fully connected layers to classify the strain (WT vs. cortexless) from the spectrograms of each vocalization (see Fig 3A for a schematic of the network structure). A The average performance of the DNN (cross-validation across recordings) was 63.4±5.3%. WT vocalizations were classified with an accuracy that did not differ statistically. B The DNN performance exceeded the performance of ridge regression (regularized linear regression, blue, 51.0±1.5%) and support vector machines (SVM, green, 55.0.2±4.1%). Bars in light colors show the corresponding estimation with randomized labels, which are all at chance level (gray line).

More »

Expand

Fig 6.

Structural analysis of network activity and representation.

Across the layers of the network (top, columns in B/F) the activity progressively dissociated from the image level (compare A and B) (left, A/E). The stimuli (A, samples, top: female; bottom: male) are initially encoded spatially in the early convolutional layers (B, left, CV), but progressively lead to more general activation. In the fully connected layers (B, right, FC), the representation becomes non-spatial. Concurrently, the sparsity (C) and within-sex correlation between FC representations increases (D, red/blue) towards the output layer, while across-sex correlation decreases (D, purple). The average correlation, however, stays limited to ~0.2, and thus the final performance is only achieved in the last step of pooling onto the output units. Using deconvolution [29], the network representation was also studied across layers. The representation of individual stimuli (E) became more faithful to the original across layers (F, from left to right, top: female, bottom: male sample). Correlating the deconvolved stimulus with the original stimulus exhibited a fast rise to an asymptotic value (G). In parallel, the similarity of the representation between the neurons of a layer improved through the network (H). In all plots in the right column, the error bars indicate 2 SEMs, which are vanishingly small due to the large number of vocalizations/neurons each point is based on.

More »

Expand

Fig 7.

In-depth analysis of vocalization space indicates complex combination of properties distinguishing emitter sex.

A Low-dimensional representation of the entire set of vocalizations (t-SNE transform of the spectrograms from 10^4 to 3 dimensions) shows limited clustering and structuring, and some separation between male (blue) and female (red) emitters. See also S1 Video, which is a dynamic version of the present figure, revolving all plots for clarity. B Individual samples of vocalizations, where the bottom three originate from the separate, large male cluster in the lower left of A. They all have a similar, long low-frequency call, combined with a higher-frequency, delayed call. The male cluster contains vocalizations from all male mice, and is hence not just an individual property. This indicates that a subset of vocalizations is rather characteristic for its emitter's sex. C The difference (bottom) between male (top left) and female (top right) densities indicates interwoven subregions of space dominated by one sex, i.e. blue subregions indicate male-dominant vocalization types, and red subregions female dominant. D Restricting to a subset of clearly identifiable vocalizations (based on DNN output certainty, <0.1 (female) and >0.9 (male)) provides only limited improvement in separation, indicating that the DNN decides based on a complex combination of subregions/spectrogram properties. E Mean frequency of the vocalization exhibits local neighborhoods on the tSNE representation, in particular linking the dominantly male cluster with exceptionally low frequencies. F Similarly, the frequency range of vocalizations in the dominantly male cluster is comparably high. G Lastly, the typical duration of the dominantly male cluster lies in a middle range, while not being exclusive in this respect.

More »