Fig 1.
The aim of spatial hearing is to estimate the position of a sound source using acoustic cues. The sound localiser needs to be learned and continuously re-calibrated over the individual’s lifetime. (A) Different learning paradigms and the overall model of the Agent for spatial hearing. The acoustic environment is simulated using pre-recorded head-related transfer functions (HRTFs), converting a sound stimuli into a cochleagram (cochlear responses in each frequency band) and fed to a small innate “Teacher” circuit and a bigger plastic “Student”. The Student is implemented as a deep neural network and trained with coarse-grained feedback from the noisy Teacher, a process we refer to as “bootstrapping”. (B) One of the Teacher circuits we use is an abstract model of the lateral superior olive (LSO) receiving excitatory input from one side and inhibitory input from the other. (C) Typical response tuning curve of an LSO unit to different sound locations, showing its basic spatial hearing function as a left/right classifier, but not accurate enough as the sound localiser which should predict the exact angular value y. Abbreviations: LF–Low Frequency. HF–High Frequency. ILD–Interaural Level Difference. DNN–Deep Neural Network.
Fig 2.
Bootstrapping spatial hearing from an innate circuit.
(A) Interactive learning procedure with the left/right Teacher circuit. The agent makes an initial prediction of the sound location with its learned Student network, turns its head towards that sound, and uses the coarse-grained feedback (left/right) from the Teacher circuit to update the learned localiser based on whether it undershot or overshot the target. (B) A Teacher circuit implementation using a single lateral superior olive (LSO) neuron as the left/right discriminator (C) Another Teacher circuit using a population ensemble of LSO neurons. (D) Mean normalized firing rates of two Teachers—single LSO neuron (yellow) and LSO neural population (blue)—as functions of sound source angle, with variance indicated by vertical bars. Inset shows responses near midline (0°), where LSO neural population Teacher exhibits a slight leftward bias in the 0.5 firing rate crossing point, while the single LSO neuron Teacher shows a minimal rightward bias. (E) Response variance across sound positions. Both Teachers approach theoretical maximum Bernoulli variance (0.25) near their respective midline positions and minimal variance at lateral positions, indicating increased uncertainty for left-right discrimination at positions approaching the midline. LSO neural population Teacher shows narrower variance peaks and lower variance magnitude compared to single LSO neuron, with peak variance position reflecting the same directional biases observed in the mean responses. (F) Learning trajectories for Student networks trained with each Teacher type (mean shown as solid line, individual repeated experiments as shaded lines). All achieve mean absolute errors (MAE) below 5° after training, with LSO neural population Teacher enabling faster initial convergence but slightly higher final error. (G) Spatial distribution of localization errors after training (solid lines show mean across runs, shaded regions show ±1 SD). Both Teacher types enable precise mapping across all positions, with lowest error near the midline and increased errors at ±90°. Students that learns from the LSO neural population Teacher show marginally higher average error (dashed horizontal lines), consistent with the Teacher’s inherent directional bias.
Fig 3.
Student error follows Teacher bias.
Top plot: tuning curves of 32 lateral superior olive (LSO) neurons for different sound source locations, color-coded by their estimated bias (code shown in the vertical line segments below). While most neurons exhibit small biases near the true midline, several show larger deviations, demonstrating natural variability for different Teachers with the same innate circuit connectivity. Bottom plot: the estimated Teacher bias matches the trained DNN Student test error (measured here as signed error, positive for clockwise direction difference, negative for counter-clockwise). Four out of 96 cases failed to converge within the allowed training period due to extremely noisy Teacher signals that slows down Student learning. Right hand plots: learned maps for four representative cases (a-d). Each point on the inner circle represents the true angle on the horizontal plane, while corresponding points on the outer circle show the angle predicted by the Student model. Connecting lines are color-coded by true angle. Case b (LSO neuron no.24 in [31]) shows the ideal radius-aligned lines, indicating accurate learning. Cases a and c (neurons no.1 and no.32) demonstrate systematic clockwise or counter-clockwise bias respectively, illustrating shifted Student maps bootstrapped from biased Teachers—similar to shifted auditory maps learned with visual prism adaptation. Case d is one of the 4 unconverged cases due to high variance in LSO Teacher signal.
Fig 4.
Effects of cue disruptions and re-calibration mechanisms.
Response of the Teacher circuit (top row) and DNN Student (middle row without bootstrapping; bottom row with bootstrapping). The left column shows the original ILD response curve of the Teacher (top) and the good performance of the Student (bottom) before any acoustic cues are disrupted. The three right columns show the effects after three different types of disruptions to the acoustic cues. A symmetrical shift (left column, symmetrical bilateral hearing loss) leaves the ILD sensitivity curves (top row) of the Teacher unchanged. A symmetrical scaling (middle column, symmetrical bilateral auditory compression disruption) stretches the response curve along the ILD axis but doesn’t change the bias (preference for left/right). An asymmetrical scaling (right column, asymmetrical unilateral hearing loss), changes the bias of the LSO curve, although this can be restored with two labeled data points (green curve). In contrast, the DNN Student is much more sensitive to any disruptions in acoustic cues. The Student prediction is initially disrupted with high errors (middle row), but after recalibration (relearning using the Teacher) good performance is restored for the symmetrical disruptions (which do not change the bias). In the case of the asymmetrical disruption (right column), performance is restored after recalibration of the Teacher (green curve, bottom row).
Fig 5.
Innate circuit detecting midline alignment as an intrinsic reward.
(A) An interactive procedure of using the innate Teacher circuit to detect the midline alignment, used as the intrinsic reward signal for reinforcement learning without any external labeling. (B) The Teacher circuit implementation, where the left LSO output and the right LSO output are combined. Circuits with similar connectivity and tuning curves have been found in the inferior colliculus (IC). (C) Sampled tuning curve of the Teacher circuit, showing the basic function as a midline detector - fires when the agent faces the sound. In the reinforcement learning procedure, output spiking means positive reward, no spiking means no reward, offering an alternative model of auditory orienting response(AOR). (D) Test errors of the Student after training, in the frontal semicircle
Fig 6.
Bootstrapping procedures for different learning contexts.
Scenario 1: bootstrapping with a limited head rotation range. (A) The agent estimates sound locations within the rotation range using an innate Teacher, and estimate sound locations beyond the rotation range using a learned localiser as Teacher. Gradient estimation with innate Teacher can be done by any method, e.g. surrogate gradient or reinforcement learning. (B) Expanding the already learned localisation range (green area) into the “inferrable area” (blue) by combining head rotation and bootstrapping on existing localisation range, constructing an array of Teachers and Students during the process, by replacing the old Teacher with the learned localiser after each expansion. (C) The process by which the learned part of the localisation range expands to fill the whole space during training. Scenario 2: using a unilateral rather than bilateral Teacher circuit. (D) The bootstrapping process, showing how a gradient can be estimated with a unilateral circuit by pointing the right ear toward the sound source to find the loudest response. (E) The unilateral circuit with an abstract memory unit. (F) 2D response map of the right ear, showing the peak response when the right ear is pointed toward the sound. Scenario 3: extending from a circular localiser to a spherical localiser. (G) The process by which the circular procedures can be extended to spherical procedures by composing them for azimuth and then elevation. (H) Illustration of azimuth and elevation.
Fig 7.
Spherical localiser calibration with a limited head rotation range and a unilateral Teacher circuit.
(A) Evolution of the error in the learned localiser during training. Initially, only a small region around the peak response can be accurately estimated (blue region), but over training time the area of accuracy expands to cover the entire space. (B) The approximated 3D sound response of the right ear of the KEMAR manikin, showing the peak response at elevation=, azimuth=
on the spherical cap.
Fig 8.
Upper rows, models from this paper. Bottom row, model from [14]. Left column, abstract circuit: schematic representation of the circuit implementing the model. Next column, response function: tuning of the response of the Teacher circuit, showing different basic functions. Next column, example procedure: interactive learning procedures using the Teacher circuits. Right column, requirements: environment conditions need to be met to enable the Teacher and the procedure, here the main requirement is how many times the model needs to listen to the sound source.
Fig 9.
A sound S at location y is first filtered via a pair of head related transfer functions (HRTF) to give the sounds Lt and Rt received by the two ears. Within each ear, a bandpass filter bank following the equivalent rectangular bandwidths (ERB) formula is applied to generate a cochleagram Lfb and Rfb. The sound level for each frequency band is computed with the function fi. In the general scenario, a sound level could be converted into a model of the auditory nerve fiber (ANF) response magnitude via a nonlinear function , although in this paper we do not need a detailed model of the ANF response and therefore we simply use the identity function. Finally, the ANF responses are fed to the various different LSO and DNN circuits used.