Fig 1.
Example of two simulated microphone signals.
Visualization of the signals y1 (blue) and y2 (orange).
Fig 2.
Plot of the cross correlation of y1 and y2 for positive lags.
Fig 3.
Visualization of the signals y1 (blue) and y2 (orange) after applying a delay of 20 samples to y1.
Fig 4.
Cross correlation with different estimators.
The output of three cross correlation estimators, GCC-PHAT (blue), frequency domain (orange), and time domain (green) are plotted for a 350 ms frame of speech.
Fig 5.
Geometry used to estimate DOA from two microphones (M1, and M2) that are separated by length D.
Table 1.
Overview of the conditions recorded.
Fig 6.
Spectrogram of audio sample with human speech and REEM-C motions, represented in dB.
Three regions with energy that correspond to the sound from the REEM-C arm motors are highlighted.
Table 2.
Parameter spaces defined for both methods.
(min, max)= uniform distribution.
(mean, std) = normal distribution.
Fig 7.
Brute force DOA performance against frame size (s) and step size (%).
The small white region corresponds to parameter combinations that resulted in no windows of speech being detected.
Fig 8.
Brute force classification performance against low and high thresholds.
The white region corresponds to parameter combinations that resulted in no windows of speech being detected.
Fig 9.
TPE DOA performance against frame size (s) and step size (%).
The white region corresponds to parameter combinations that were not tested by the TPE method.
Fig 10.
TPE classification performance visualized against frame size (s) and step size (%).
The white region corresponds to parameter combinations that were not tested by the TPE method.
Fig 11.
Joint objective loss vs. step size and frame size.
The white region corresponds to parameter combinations that were not tested by the TPE method.
Fig 12.
Joint regularized objective loss vs. step size and frame size.
The white region corresponds to parameter combinations that were not tested by the TPE method.
Fig 13.
Frame size vs. average joint objective loss.
Fig 14.
Frame size vs. average joint regularized objective loss.
Table 3.
Best parameters and results for each task across optimization methods.
Table 4.
Test set performance.
Fig 15.
The green regions indicate intervals where speech was present. The blue dots indicate the estimated DOA using best study parameters. The actual DOA of the speaker is indicated by the dotted line.
Fig 16.
The green regions indicate intervals where speech was present. The blue dots indicate the estimated DOA using best study parameters. The actual DOA of the speaker is indicated by the dotted line.
Fig 17.
The green regions indicate intervals where speech was present. The blue dots indicate the estimated DOA using best study parameters. The actual DOA of the speaker is indicated by the dotted line.
Table 5.
Comparative results for voice and timing methods.
Fig 18.
Estimates are generated comparing the GCC-PHAT (orange) and the GCC-SCOT (blue). The green regions indicate intervals where speech was present. The actual DOA of the speaker is indicated by the dotted line.
Fig 19.
Visualization of 5 dataset folds used for cross validation.
Indices shaded white are used for training, and indices shaded black are used for testing.
Table 6.
Cross validation results.
Table 7.
Best overall parameters for both methods.
Fig 20.
Microphone setup on REEM-C.
Fig 21.
Average latencies of DOA estimate for three different frame lengths.
The errorbars indicate one standard error.