Latent space visualization, characterization, and generation of diverse vocal communication signals

Animals produce vocalizations that range in complexity from a single repeated call to hundreds of unique vocal elements patterned in sequences unfolding over hours. Characterizing complex vocalizations can require considerable effort and a deep intuition about each species’ vocal behavior. Even with a great deal of experience, human characterizations of animal communication can be affected by human perceptual biases. We present here a set of computational methods that center around projecting animal vocalizations into low dimensional latent representational spaces that are directly learned from data. We apply these methods to diverse datasets from over 20 species, including humans, bats, songbirds, mice, cetaceans, and nonhuman primates, enabling high-powered comparative analyses of unbiased acoustic features in the communicative repertoires across species. Latent projections uncover complex features of data in visually intuitive and quantifiable ways. We introduce methods for analyzing vocalizations as both discrete sequences and as continuous latent variables. Each method can be used to disentangle complex spectro-temporal structure and observe long-timescale organization in communication. Finally, we show how systematic sampling from latent representational spaces of vocalizations enables comprehensive investigations of perceptual and neural representations of complex and ecologically relevant acoustic feature spaces.

echolocation clicks into UMAP latent space. Species-identity again falls out nicely, with clicks assorting into distinct 193 clusters that correspond to each species (Fig 3B). 194 Population geography Some vocal learning species produce different vocal repertoires (dialects) across populations. 195 Differences in regional dialects across populations are borne out in the categorical perception of notes [53][54][55], much 196 the same as cross-linguistic differences in the categorical perception of phonemes in human speech [56]. To compared 197 vocalizations across geographical populations in the swamp sparrow, which produces regional dialects in its trill-like Phonological features The sound segments that make up spoken human language can be described by distinctive 202 phonological features that are grouped according to articulation place and manner, glottal state, and vowel space. A 203 natural way to look more closely at variation in phoneme production is to look at variation between phonemes that 204 comprise the same phonological features. As an example, we projected sets of consonants that shared individual 205 phonological features into UMAP latent space (Figs 4,18). In most cases, individual phonemes tended to project to 206 distinct regions of latent space based upon phonetic category, and consistent with their perceptual categorization. At the 207 same time, we note that latent projections vary smoothly from one category to the next, rather than falling into discrete 208 clusters. This provides a framework that could be used in future work to characterize the distributional properties of  Unlike human speech, UMAP projections of birdsongs fall more neatly into discriminable clusters (Fig 1). If clusters 214 in latent space are highly similar to experimenter-labeled element categories, unsupervised latent clustering could 215 provide an automated and less time-intensive alternative to hand-labeling elements of vocalizations. To examine this, 216 we compared how well clusters in latent space correspond to experimenter-labeled categories in three human-labeled 217 datasets: two separate Bengalese finch datasets [57,58], and one Cassin's vireo dataset [7]. We compared three different 218 labeling techniques: a hierarchical density-based clustering algorithm (HDBSCAN; [59]) applied over latent projections 219 in UMAP, k-means [60] clustering applied over UMAP, and k-means clustering applied over spectrograms (Fig 5; Table   220 1). To make the k-means algorithm more competitive with HDBSCAN, we set the number of clusters in k-means equal 221 to the number of clusters in the hand-clustered dataset, while HDBSCAN was not parameterized at all. We computed 222 the similarity between hand and algorithmically labeled datasets using four different metrics (See Methods section). For 223 all three datasets, HDBSCAN clustering over UMAP projections is most similar to hand labels and visually overlaps 224 best with clusters in latent space (Fig 5; Table 1). These results show that latent projections facilitate unsupervised 225 clustering of vocal elements into human-like syllable categories better than spectrographic representations alone. At the 226 same time, latent clusters do not always exactly match experimenter labels, a phenomenon that we explore in greater 227 depth in the next section. Animal vocalizations are not always comprised of single, discrete, temporally-isolated elements (e.g. notes, syllables, or 230 phrases), but often occur as temporally patterned sequences of these elements. The latent projection methods described 231 above can be used to abstract corpora of song elements that can then be used for syntactic analyses [3].

232
As an example of this, we derived a corpus of symbolically segmented vocalizations from a dataset of Bengalese finch 233 song using latent projections and clustering (Fig 6). Bengalese finch song bouts comprise a small number (~5-15) of 234 highly stereotyped syllables produced in well-defined temporal sequences a few dozen syllables long [4]. We first 235 projected syllables from a single Bengalese finch into UMAP latent space, then visualized transitions between vocal 236 elements in latent space as line segments between points (Fig 6B), revealing highly regular patterns. To abstract this 237 organization to a grammatical model, we clustered latent projections into discrete categories using HDBSCAN. Each Points are colored by their hand-labeled categories, which generally fall into discrete clusters in UMAP space. Each other frame is the same scatterplot, where colors are cluster labels produced using (B) k-means over UMAP projections (C) k-means directly on syllable spectrograms (D) HDBSCAN on UMAP projections. Sequential organization is tied to transcription method As we previously noted, hand labels and latent cluster 243 labels of birdsong syllables generally overlap (e.g. Fig 5), but may disagree for a sizable minority of syllables (Table   244   1 To contrast the two labeling methods, we first took the two Bengalese finch song datasets, projected syllables into 251 latent space, and visualized them using the hand transcriptions provided by the datasets (Fig 7A,H). We then took 252 the syllable projections and clustered them using HDBSCAN. In both datasets, we find that many individual hand-253 transcribed syllable categories are comprised of multiple HDBSCAN-labelled clusters in latent space (Fig 7A,B,H,I).

254
To compare the different sequential abstractions of the algorithmically transcribed labels and the hand transcribed 255 labels, we visualized the transitions between syllables in latent space (Fig 7C,J). These visualizations reveal that

273
Not all vocal repertoires are made up of elements that fall into highly discrete clusters in latent space (Fig 1). For several 274 of our datasets, categorically discrete elements are not readily apparent, making analyses such as those performed in 275 Figure 6 more difficult. In addition, many vocalizations are difficult to segment temporally, and determining what 276 features to use for segmentation requires careful consideration [1]. In many bird songs, for example, clear pauses 277 exist between song elements that enable one to distinguish syllables. In other vocalizations, however, experimenters 278 must rely on less well-defined physical features for segmentation [1,12], which may in turn invoke a range of biases 279 and unwarranted assumptions. At the same time, much of the research on animal vocal production, perception, and 280 sequential organization relies on identifying "units" of a vocal repertoire [1]. To better understand the effects of temporal 281 discretization and categorical segmentation in our analyses, we considered vocalizations as continuous trajectories in 282 latent space and compared the resulting representations to those that treat vocal segments as single points (as in the 283 previous finch example in Fig. 6). We explored four datasets, ranging from highly discrete clusters of vocal elements   European starling song provides an interesting case study for exploring the sequential organization of song using 302 continuous latent projections because starling song is more sequentially complex than Bengalese finch song, but is 303 still highly stereotyped and has well-characterized temporal structure. European starling song is comprised of a large discretized, they are relatively clusterable (Fig 1), however syllables tend to vary somewhat continuously (Fig 9D).

311
To analyze starling song independent of assumptions about segment (motif) boundaries and element categories, we 312 projected bouts of song from a single male European starling into UMAP trajectories using the same methods as in 313 Figure 8. 314 We find that the broad structure of song bouts are highly repetitive across renditions, but contain elements within each 315 bout that are variable across bout renditions. For example, in Figure 9A utility of continuous latent trajectories as a viable alternative to discrete methods for analyzing song structure even with 327 highly complex, many-element, song. House mice produce ultrasonic vocalizations (USVs) comprising temporally discrete syllable-like elements that are 330 hierarchically organized and produced over long timescales, generally lasting seconds to minutes [65]. When analyzed 331 for temporal structure, mouse vocalizations are typically segmented into temporally-discrete USVs and then categorized 332 into discrete clusters [1,39,[65][66][67] in a manner similar to syllables of birdsong. As Figure 1 shows, however, USVs do 333 not cluster into discrete distributions in the same manner as birdsong. Choosing different arbitrary clustering heuristics 334 will therefore have profound impacts on downstream analyses of sequential organization [39]. 335 We sought to better understand the continuous variation present in mouse USVs, and explore the sequential organization trajectories (Fig 10E) in UMAP latent space using similar methods as with starlings (Fig. 8) and finches (Fig. 9). In 338 Figure 10, we use a single recording of one individual producing 1,590 (Fig. 10G) USVs over 205 seconds as a case 339 study to examine the categorical and sequential organization of USVs. We projected every USV produced in that 340 sequence as a trajectory in UMAP latent space (Fig. 10A,C,D). Similar to our observations in Figure 1I using discrete 341 segments, we do not observe clear element categories within continuous trajectories, as observed for Bengalese finch 342 song (e.g. Fig 8I).

343
To explore the categorical structure of USVs further, we reordered all of the USVs in Figure 10G by the similarity of 344 their latent trajectories (Fig. 10F) and plotted them side-by-side (Fig. 10H). Both the similarity matrix of the latent 345 trajectories (Fig. 10F) and the similarity-reordered spectrograms (Fig. 10H) show that while some USVs are similar to 346 their neighbors, no highly stereotyped USV categories are observable.

347
Although USVs do not aggregate into clearly discernible, discrete clusters, the temporal organization of USVs within 348 the vocal sequence is not random. Some latent trajectories are more frequent at different parts of the vocalization.

349
In Figure 10A, we color-coded USV trajectories according to each USV's position within the sequence. The local 350 similarities in coloring (e.g., the purple and green hues) indicate that specific USV trajectories tend to occur in distinct 351 parts of the sequence. Arranging all of the USVs in order (Fig. 10G) makes this organization more evident, where one 352 can see that shorter and lower amplitude USVs tend to occur more frequently at the end of the sequence. To visualize 353 the vocalizations as a sequence of discrete elements, we plotted the entire sequence of USVs (Fig. 10I), with colored 354 labels representing the USV's position in the reordered similarity matrix (in a similar manner as the discrete category 355 labels in Fig. 6E. In this visualization, one can see that different colors dominate different parts of the sequence, again 356 reflecting that shorter and quieter USVs tend to occur at the end of the sequence. 'd', 's', or 'w' phonemes, respectively. This results in differences in the pronunciation of 'ey' across words (Fig 11F).

362
Co-articulation explains much of the acoustic variation observed within phonetic categories. Abstracting to phonetic 363 categories therefore discounts much of this context-dependent acoustic variance. 364 We explored co-articulation in speech, by projecting sets of words differing by a single phoneme (i.e. minimal pairs) 365 into continuous latent spaces, then extracted trajectories of words and phonemes that capture sub-phonetic context-366 dependency (Fig. 11). We obtained the words from the same Buckeye corpus of conversational English used in Figures   367   1, 4, and 18. We computed spectrograms over all examples of each target word, then projected sliding 4-ms windows 368 from each spectrogram into UMAP latent space to yield a continuous vocal trajectory over each word (Fig. 11). We 369 visualized trajectories by their corresponding word and phoneme labels (Fig. 11B,C) and computed the average latent 370 trajectory for each word and phoneme (Fig. 11D,E). The average trajectories reveal context-dependent variation within 371 phonemes caused by coarticulation. For example, the words 'way', 'day', and 'say' each end in the same phoneme ('ey'; 372 Fig. 11A-F), which appears as an overlapping region in the latent space (the red region in Fig 11C). The endings of each 373 average word trajectory vary, however, indicating that the production of 'ey' differs based on its specific context (Fig   374   11D). The difference between the production of 'ey' can be observed in the average latent trajectory over each word, 375 where the trajectories for 'day' and 'say' end in a sharp transition, while the trajectory for 'way' is more smooth (Fig   376   11D). These differences are apparent in figure 11F which shows examples of each word's spectrogram accompanied 377 by its corresponding phoneme labels and color-coded latent trajectory. In the production of 'say' and 'day' a more 378 abrupt transition occurs in latent space between 's'/'d' and 'ey', as indicated by the yellow to blue-green transitions 379 above spectrograms in 'say' and the pink to blue-green transition above 'day'. For 'way', in contrast, a smoother 380 transition occurs from the purple region of latent space corresponding to 'w' to the blue-green region of latent space 381 corresponding to 'ey'.

382
Latent space trajectories can reveal other co-articulations as well. In Figure 11G, we show the different trajectories 383 characterizing the phoneme 't' in the context of the word 'take' versus 'talk'. In this case, the 't' phoneme follows a 384 similar trajectory for both words until it nears the next phoneme ('ey' vs. 'ao'), at which point the production of 't' 385 diverges for the different words. A similar example can be seen for co-articulation of the phoneme 'eh' in the words 386 'them' versus 'then' (Fig. 11H). These examples show the utility of latent trajectories in describing sub-phonemic 387 variation in speech signals in a continuous manner rather than as discrete units.  repertoires in complex natural feature spaces. To do this, latent models must be bidirectional: in addition to projecting 395 vocalizations into latent space, they must also sample from latent space to generate novel vocalizations. That is, where 396 dimensionality reduction only needs to project from vocalization space (X) to latent space (Z), X → Z, generativity 397 requires bidirectionality: X ↔ Z. In the following section we discuss and explore the relative merits of a series of 398 neural network architectures that are designed to both reduce dimensionality and generate novel data. translates from X → Z and a decoder which translates from Z → X (Fig. 12A). The network is trained on a single error 410 function: to reconstruct in X as well as possible. Because this reconstruction passes through a reduced-dimensional 411 latent layer (Z), the encoder learns an encoding in Z that compressively represents the data, and the decoder learns to  network architectures on higher dimensional spectrograms of European starlings syllables with a 128-dimensional 472 latent space (Fig. 14). We plotted reconstructions of syllables as J-diagrams [76] which show both reconstructions and 473 morphs generated through latent interpolations between syllables [30, 31]. Across networks, we observe that syllables 474 generated with AEs (Fig 14A,B) appear more smoothed over, while reconstructions using adversarial-based networks 475 appear less smoothed over but reconstructed syllables match the original syllables less closely (Fig 14C,D). interpolations between pairs of syllables in the same manner as was shown in Figure 14A. We sampled six acoustically 489 distinct syllables of song, three were arbitrarily assigned to one response class, and three to another. Interpolations  (Fig. 15B,E) that we used as playback stimuli in our behavioral experiment.

494
Using an established operant conditioning paradigm ([82]; Fig. 15A), we trained six European starlings to associate each 495 morph with a peck to either the left or right response port to receive food. The midpoint in each motif continuum was 496 set as the categorical boundary. Birds learned to peck the center response port to begin a trial and initiate presentation of 497 a syllable from one of the morph continua, which the bird classifies by pecking into either the left or right response port.

498
Correct classification led to intermittent reinforcement with access to food; incorrect responses triggered a brief (1-5 499 second) period during which the house light was extinguished and food was inaccessible. Each bird learned the task to a 500 level of proficiency well above chance (∼ 80% -95%; chance=50%), and a psychometric function of pecking behavior 501 (Fig. 15B)  computed the PSTH of that unit's response to the stimulus over repeated presentations, then convolved the PSTH with a 517 5ms Gaussian kernel to get an instantaneous spike rate vector over time for each of the stimuli (Fig. 15C). Figure 15D 518 shows an example of the spike rate vector (as in 15C) for each of the stimuli in a single morph continuum for each of 35 519 putative simultaneously recorded neurons extracted from one recording site. Figure 15F shows the similarity between  including songbirds, primates, rodents, bats, and cetaceans ( Fig. 1). In general, songbirds tend to produce signals that 530 cluster discretely in latent space, whereas mammalian vocalizations are more uniformly distributed. This observation 531 deserves much closer attention with even more species. We also showed that complex features of datasets, such as 532 individual identity (Fig. 2), species identity (Fig. 3A,B our methods show that a priori feature-based compression is not a prerequisite to progress in understanding behaviorally 536 relevant acoustic diversity. We used these latent projections to visualize sequential organization and abstract sequential 537 models of song (Fig. 6) and demonstrated that in some cases latent approaches confer advantages over hand labeling or 538 supervised learning (Fig. 7). We also projected vocalizations as continuous trajectories in latent space (Figs. 8,9,10,539 and 11). This provides a powerful method for studying sequential organization without discretizing vocal sequences [1].

540
In addition, we surveyed several deep neural network architectures (Fig. 12) that learn latent representations of vocal 541 repertoires and systematically generate novel syllables from the features in latent space (Figs. 13, 14). Finally, we gave 542 an example of how these methods can be combined in a behavioral experiment to study perception with psychometric 543 precision, and in an acute electrophysiological experiment to understand representational encoding of parametrically 544 varying natural vocal signals (Fig. 15). segmenting vocalizations into discrete temporal units. In many species, temporally segmenting vocalizations into 557 discrete elements is a natural step in representing vocal data. In birdsong, for example, temporally distinct syllables are 558 often well defined by clear pauses between highly stereotyped syllable categories. In many other species, however, 559 vocal elements are either less clearly stereotyped or less temporally distinct, and methods for segmentation can vary 560 based upon changes in a range of acoustic properties, similar sounds, or higher-order organization [1]. These constraints 561 force experimenters to make decisions that can have profound effects on downstream analyses [29,39]. We projected 562 continuous latent representations of vocalizations ranging from highly stereotyped Bengalese finch song, to highly 563 variable mouse USVs, and found that continuous latent projections effectively described useful aspects of spectro-564 temporal structure and sequential organization. In human speech, we found that continuous latent variable projections 565 were able to capture sub-phoneme temporal dynamics that correspond to co-articulation. Collectively, our results show 566 that continuous latent representations of vocalizations provide an alternative to discrete segment-based representations 567 while remaining agnostic to segment boundaries, and without the need to segment vocalizations into discrete elements 568 or symbolic categories. Of course, where elements can be clustered into clear and discrete element categories, it is 569 important to do so. The link from temporally continuous vocalization to symbolically discrete sequences will be an 570 important target for future investigations.

571
Choosing a network architecture The generative neural networks and machine learning models presented here 572 are only a tiny sample of a very rapidly growing and changing field. We did not explore many of the potentially output. The other network architectures we surveyed, as well as many emerging network architectures and algorithms, 580 may offer promising avenues for generating even more realistic, higher fidelity vocal data, and for learning structure-rich 581 latent feature spaces. Our brief survey is not meant to be exhaustive, but rather to serve as an introduction to many of 582 the potentially rich uses of existing and future neural networks in generating and sampling from latent representational 583 spaces of vocal data.

584
Future work The work presented here is a first step in exploring the potential power of latent and generative 585 techniques in modeling animal communication. We touch only briefly on a number of questions that we find interesting 586 and think important within the field of animal communication. Other researchers may certainly want to target other 587 questions, and we hope that some of these techniques (and the provided code) may be adapted in that service. Our 588 analyses were taken from a diverse range of animals, sampled in diverse conditions both in the wild and in the laboratory, 589 and are thus not well controlled for variability between species. Certainly, as bioacoustic data becomes more open and 590 readily available, testing large, cross-species, hypotheses will become more plausible. We introduced several areas 591 in which latent models can act as a powerful tool to visually and quantitatively explore complex variation in vocal 592 data. These methods are not restricted to bioacoustic data, however. Indeed many were designed originally for image 593 processing. We hope that the work presented here will encourage a larger incorporation of latent and unsupervised  For all other datasets, we used a segmentation algorithm we call dynamic threshold segmentation (Fig. 16A). The 617 goal of the algorithm is to segment vocalization waveforms into discrete elements (e.g. syllables) that are defined as 618 regions of continuous vocalization surrounded by silent pauses. Because vocal data often sits atop background noise, 619 the definition for silence versus vocal behavior was set as some threshold in the vocal envelope of the waveform. The 620 purpose of the dynamic thresholding algorithm is to set that noise threshold dynamically based upon assumptions about 621 the underlying signal, such as the expected length of a syllable or a period of silence. The algorithm first generates a 622 spectrogram, thresholding power in the spectrogram below a set level to zero. It then generates a vocal envelope from 623 the power of the spectrogram, which is the maximum power over the frequency components times the square root of the 624 average power over the frequency components for each time bin over the spectrogram: Y of m data points sampled from either a uniform or Gaussian distribution. We chose to sample Y from a uniform 666 distribution over the convex subspace of X. The Hopkin's metric is then computed as: Where u i is the distance of y i ∈ Y from its nearest neighbor in X and w i is the distance of x i ∈ X from its nearest 668 neighbor in X. Thus if the real dataset is more clustered than the sampled dataset, the Hopkin's statistic will approach 669 0, and if the dataset is less clustered than the randomly sampled dataset, the Hopkin's statistic will sit near 0.5. Note that the Hopkin's statistic is also commonly computed with m i=1 u d i in the numerator rather than Hopkin's statistics closer to 1 would be higher clusterability, and closer to 0.5 would be closer to chance. We chose the 672 former method because the range of Hopkin's statistics across datasets were more easily visible when log transformed.

673
Comparing algorithmic and hand-transcriptions Several different metrics can be used to measure the overlap 674 between two separate labeling schemes. We used four metrics that capture different aspects of similarity to compare hand 675 labeling to algorithmic clustering methods ([60]; Table 1). Adjusted Mutual Information is an information-theoretic 676 measure that quantifies the agreement between the two sets of labels, normalized against chance. Completeness 677 measures the extent to which members belonging to the same class (hand label) fall into the same cluster (algorithmic 678 label). Homogeneity measures whether all clusters fall into the same class in the labeled dataset. V-Measure is the 679 harmonic mean between homogeneity and completeness. We found that HDBSCAN and UMAP showed higher 680 similarity to human labeling than k-means in nearly all metrics across all three datasets.

681
Data Availability All of the vocalization datasets used in this study were acquired from external sources, most of 682 them hosted publicly online. The behavioral and neural data are part of a larger project and will be released alongside 683 that manuscript.

684
Code Availability The python code written specifically for this paper is available at Github.com/timsainb/AVGN_ 685 paper. A cleaner and more maintained code base is additionally available at Github.com/timsainb/AVGN.