Classifying sex and strain from mouse ultrasonic vocalizations using deep learning
Fig 3
Deep neural network reliably determines the emitter's sex from individual vocalizations.
A We trained a deep neural network (DNN) with 6 convolutional and 3 fully connected layers to classify the sex of emitter from the spectrogram of the vocalization. B The network's performance on the (Female #1) test set rapidly improved to an asymptote of ~80% (dark red), clearly exceeding chance. Correspondingly the change in the network's weights (light red) progressively decreased after stabilizing after ~6k iterations. Data shown for a representative training run. C The shape of the input fields in the first convolutional layer became more reminiscent of the tuning curves in the auditory system [50,51]. Samples are representatively chosen among the entire set of 256 units in this layer. D The average performance of the DNN (cross-validation across animals) was 76.7±6.6%, which did not differ significantly between male and female vocalizations (p>0.05, Wilcoxon rank sum test). E The DNN performance by far exceeded the performance of ridge regression (regularized linear regression, blue, 50.7±1.6%) and support vector machines (SVM, green, 56.2±1.9%). Bars in light colors show the corresponding estimation with randomized labels, which are all at chance level (gray line). F The performance by the DNN was not only limited by the properties of the spectrograms (e.g. background noise, sampling, etc.) since a DNN trained on the number of breaks (right bars) performed significantly better. This control shows that the identical set of stimuli can be better separated on a simpler (but also binary) task. Light bars again show performance on randomized labels.