Estimating speaker direction on a humanoid robot with binaural acoustic signals

doi:10.1371/journal.pone.0296452

Fig 1.

Example of two simulated microphone signals.

Visualization of the signals y₁ (blue) and y₂ (orange).

More »

Expand

Fig 2.

Plot of the cross correlation of y₁ and y₂ for positive lags.

More »

Expand

Fig 3.

Visualization of the signals y₁ (blue) and y₂ (orange) after applying a delay of 20 samples to y₁.

More »

Expand

Fig 4.

Cross correlation with different estimators.

The output of three cross correlation estimators, GCC-PHAT (blue), frequency domain (orange), and time domain (green) are plotted for a 350 ms frame of speech.

More »

Expand

Fig 5.

Geometry used to estimate DOA from two microphones (M1, and M2) that are separated by length D.

More »

Expand

Table 1.

Overview of the conditions recorded.

More »

Expand

Fig 6.

Spectrogram of audio sample with human speech and REEM-C motions, represented in dB.

Three regions with energy that correspond to the sound from the REEM-C arm motors are highlighted.

More »

Expand

Table 2.

Parameter spaces defined for both methods.

(min, max)= uniform distribution. (mean, std) = normal distribution.

More »

Expand

Fig 7.

Brute force DOA performance against frame size (s) and step size (%).

The small white region corresponds to parameter combinations that resulted in no windows of speech being detected.

More »

Expand

Fig 8.

Brute force classification performance against low and high thresholds.

The white region corresponds to parameter combinations that resulted in no windows of speech being detected.

More »

Expand

Fig 9.

TPE DOA performance against frame size (s) and step size (%).

The white region corresponds to parameter combinations that were not tested by the TPE method.

More »

Expand

Fig 10.

TPE classification performance visualized against frame size (s) and step size (%).

The white region corresponds to parameter combinations that were not tested by the TPE method.

More »

Expand

Fig 11.

Joint objective loss vs. step size and frame size.

The white region corresponds to parameter combinations that were not tested by the TPE method.

More »

Expand

Fig 12.

Joint regularized objective loss vs. step size and frame size.

The white region corresponds to parameter combinations that were not tested by the TPE method.

More »

Expand

Fig 13.

Frame size vs. average joint objective loss.

More »

Expand

Fig 14.

Frame size vs. average joint regularized objective loss.

More »

Expand

Table 3.

Best parameters and results for each task across optimization methods.

More »

Expand

Table 4.

Test set performance.

More »

Expand

Fig 15.

Test 6 DOA results.

The green regions indicate intervals where speech was present. The blue dots indicate the estimated DOA using best study parameters. The actual DOA of the speaker is indicated by the dotted line.

More »

Expand

Fig 16.

Test 28 DOA results.

The green regions indicate intervals where speech was present. The blue dots indicate the estimated DOA using best study parameters. The actual DOA of the speaker is indicated by the dotted line.

More »

Expand

Fig 17.

Test 36 DOA results.

The green regions indicate intervals where speech was present. The blue dots indicate the estimated DOA using best study parameters. The actual DOA of the speaker is indicated by the dotted line.

More »

Expand

Table 5.

Comparative results for voice and timing methods.

More »

Expand

Fig 18.

Test 29 DOA results.

Estimates are generated comparing the GCC-PHAT (orange) and the GCC-SCOT (blue). The green regions indicate intervals where speech was present. The actual DOA of the speaker is indicated by the dotted line.

More »