The detection of algebraic auditory structures emerges with self-supervised learning

doi:10.1371/journal.pcbi.1013271

Fig 1.

A: schematic of the structure of the sounds tested over the experiments.

The right column indicate the experimental paper testing how humans detect the corresponding structure. B: schematic presentation of the Wav2vec2 model used. d refers to the dimensionality of the features at different layer. The model takes waveform as input so d = 1 at the beginning. We measure the model’s surprise as the contrastive loss to each sound elements in the sound. This loss is measured by masking 20 ms (50 Hz) latent vectors whose receptive field overlap with the sound element. This operation is repeated to get the model surprise of each sound element. C: Models are initially pretrained on one dataset among four possible: one for speech - Librispeech [42], one for music - FMA [43], one for environmental sounds - Audioset [44] deprived of speech and music and one that combine a subset of those. To see if and when structure detection emerge, we test the models on many checkpoints throughout the pretraining. Modeled studies: [8,14,21,23].

More »

Expand

Fig 2.

A–C) Zero-shot emergence of word chunking.

A. Example of a syllable stream tested. B. Model contrastive loss to the first, second and third syllable is measured for each triplet of syllables (words) in a syllable stream. The loss is averaged over 30 trial of the tasks. The loss standard deviation across these trial is indicated as shaded area. (see Fig D in S1 Material for zooms). C. Evolution of the model ability to detect regular words as a function of its pretraining. This ability is measured by the difference between the mean contrastive loss of the second random sequence and the last repetition of the regular sequence. D-E-F. Same as A-B-C but with repeated tones sequences of cycle size 5 or 20 tones. G-H-I. Same as A-B-C but with a nested algebraic pattern from the Local Global paradigm. J-K-L. Same as A,B,C but with a set of 10 algebraic patterns of increasing complexity. In panel K, the “alternate” and “center-mirror” sequences are plotted, of complexity respective 6 and 21. Modeled studies: [8,14,21,23].

More »

Expand

Fig 3.

Effect of the pretraining dataset on the emergence of structure detection.

We plot as a function of the amount of pretraining the ability of models trained on speech (A), environmental sounds (B) or music (C). Modeled studies: [8,14,21,23].

More »

Expand

Fig 4.

Decoding regular vs non-regular for each model layer.

Left: decoding accuracy of deviant versus original tones. The classifier is learned over all types of sequences and evaluation is separated across different types of sequences. The dataset are composed of 1000 sequences, with the 500 test sequences being the exact complementary to the training sequences (inversion of the role of each tone). Deviant tones are detected in earlier layers for simpler sequences. Right: earliest layer at which the decoding accuracy is maximal, for each type of sequence.

More »

Expand

Table 1.

Comparison of audio datasets in terms of duration and number of files.

More »

Expand