Brain-inspired model for early vocal learning and correspondence matching using free-energy optimization

doi:10.1371/journal.pcbi.1008566

Fig 1.

Framework of the INFERNO architecture for audio primitive retrieving based on iterative optimization through the cortico-basal ganglia loop (CX-BG).

The Primary Auditory Cortex (PAC) receives and categorizes the audio vectors as a first stage, the Superior Temporal Gyrus cortex (STG) integrates over time its outputs that are eventually categorized by the Striatum (STR) in the basal ganglia. The Globus Pallidus (GP) searches and retrieves the audio vectors that best match the STG dynamics recognized by the striatal units. The iterative optimization process is carried out by minimizing noise with a temporal difference reinforcement signal.

More »

Expand

Fig 2.

Stochastic descent gradient optimization used to control the neural dynamics.

Free-energy (noise) is injected as Input in the network. After a period of time, the Output vector is read to recognize the state and its value is compared to a goal vector. If the variational error E is decreasing, the stochastic descent gradient keeps the current Input. After several cycles, the Input converges to its optimal values that minimizes error and maximizes the state recognition stage.

More »

Expand

Fig 3.

Rank-Order Coding principle [68].

This type of neuron encodes the rank code of an input signal. Its amplitude is translated into an ordered sequence and the neuron’s synaptic weights are associated with this sequence. In our example, the neural activity is salient to this particular order, which is seen in the line widths of the synaptic weights.

More »

Expand

Fig 4.

Dynamics of different structures during and after the learning stage.

In A and B, waveform sample that the PAC layer categorizes in the form of MFCC vectors in a higher representation. In C, this information is passed to the STG layer that integrates over time (20 iterations) the incoming information. In D, evolution of the neural activity of one STR unit at different learning stages. In E, the final layer, the STR, categorizes for a second time the filtered information in a bigger neural population.

More »

Expand

Fig 5.

Free-energy optimization.

A-C, error minimization of three Striatal units (top chart) using noise to retrieve GP vectors (retrieved MFCC vectors) for which the Striatal units fire maximally (middle chart). The STG units display different spike trains for which a solution is found (bottom charts). The dashed lines correspond to a reset of the GP dynamics (reset of the optimal MFCC vector) in order to show that the minimization process is always present and that different solutions can be retrieved dynamically.

More »

Expand

Fig 6.

Reconstruction analysis after free-energy optimization.

In a), density probability distribution of the Striatal units with respect to their prediction error level. In b), density probability distribution of reconstruction error of MFCC vectors by the GP layer. For most of the neurons within the STR layer, the optimization process makes it possible to construct MFCC vectors close to the real ones from the audio database. The error reconstruction follows a central field distribution centered at 0.05 and standard deviation ± 0.05.

More »

Expand

Fig 7.

Performance analyzis after several exposures and reconstruction analysis of the audio signals.

In a), Euclidean distance between the MFCCs retrieved and those from the audio database. In b), identity mismatch between the predicted MFCCs index and the correct one for the whole audio sequence. In c), waveform reconstruction for the four learning periods.

More »

Expand

Fig 8.

Self-supervised VS forced learning.

We compare the two learning strategies resp. in A and B, in terms of convergence and dynamics. the self-supervising strategy might correspond to a babbling stage in which each audio unit is selected and tested at each cycle in a random fashion. Instead, the forcing strategy makes it possible to control the learning of each unit separately until convergence. In the supervised case (forced STR activity in B), the error is high for one specific STR unit in the beginning and then it is diminishing iteratively over time. We select one by one each STR unit until the error is diminishing to a certain threshold level during a limited amount of time, then the next neuron is selected to optimize the GP vector that optimally triggers the STG categories and the STR units. For the unsupervised case (unsupervised motor babbling in A), as at each iteration a different STR unit is selected because of internal noise, it is not clear to see such gradual decreasing of error for each unit.

More »

Expand

Fig 9.

Reconstructed Waveform and MFCC comparison.

In A, the original waveform is in blue and the reconstructed one is in red. In B, the reconstructed MFCC raster plot. In C, the raster plot of the MFCC error between the original sequence and the retrieved one.

More »

Expand

Fig 10.

Analysis of STR reconstruction and MFCC mapping during acoustic matching with different speakers.

In A, the correspondence matrix between STR units X and MFCCs vector A within the audio database of unheared voices. In B, the Euclidean distance between the MFCC vectors of the predicted STR units X with the ground truth MFCC vectors A within the audio database. In C the correspondence matrix between the ground truth MFCC vectors A and the nearest ones B from the reconstructed vectors X selected in STR, based on the correspondence matrix in A; plotted for the first 10.000 MFCC vectors. In D, a zoom in the correspondence matrix for 5000 units within the interval range [120.000; 125.000]. The diagonal indicates the good matching between what perceives the Inferno network and what it can pronounce, even from unheared MFCC samples during the learning stage. In E, the ABX distance histogram proposed by [76, 77] computed from the Euclidean distance between the A and B vectors retrieved previously. In F, an example of a retrieved waveform is provided from an unheared sound sequence after the learning stage.

More »

Expand