Estimating speaker direction on a humanoid robot with binaural acoustic signals

To achieve human-like behaviour during speech interactions, it is necessary for a humanoid robot to estimate the location of a human talker. Here, we present a method to optimize the parameters used for the direction of arrival (DOA) estimation, while also considering real-time applications for human-robot interaction scenarios. This method is applied to binaural sound source localization framework on a humanoid robotic head. Real data is collected and annotated for this work. Optimizations are performed via a brute force method and a Bayesian model based method, results are validated and discussed, and effects on latency for real-time use are also explored.


Introduction
Speech is one of the most important forms of human communication and a key element of social interaction. Thus, to better integrate humanoid robots into society and augment human-robot interaction, it is important for them to achieve speech interactions that are similar to human-human interactions. Speech interactions are a complex phenomenon that includes both verbal and non-verbal behaviour. One aspect of this non-verbal behaviour is how talkers and listeners orient their head and body relative to their conversational partner.
In the present study, we focus here on a sub-task of identifying the direction of arrival (DOA) of human speech. This is information is necessary for humanoid robots to interact with humans in realistic and natural ways, such as orienting to and tracking human conversational partners (who may move during the conversations), or handling interactions that involve multiple conversational partners.
Much work has been done on sound source localization (SSL) by robots (for a review see [1]) and many of the methods are based on cues that are used by humans to localize sound sources. Given an array of two or more microphones that are spatially separated, the sound from a source will arrive at each microphone at different times. Thus, by measuring the time difference of arrival between microphones, and knowing the geometry of the microphone array, it is possible to estimate the DOA of the source. This method is analogous to the use of inter-aural timing (ITD) difference cues used by humans. A related approach involves the use of beamforming. The output level of a beamformer should be higher if it is steered in the direction of the source. Thus, DOAs can be estimated by finding look directions which correspond to maxima of the beamformer output levels. If an object is present between the microphones in an array, that object will alter the acoustic field and can vary the level of the signals received at the different microphones. For example, if the object is large compared to the wavelength of the source, the object can cast an acoustic "shadow". Thus, microphones where the object is located in the direct path to the source will record lower levels than those where the object is not in the path. This is analagous to inter-aural intensity differences (IID) used by humans (where the head can result in substantial level differences between the ears at high frequencies). For different DOAs, the geometry of the irregularly-shaped human pinnae (the part of the ear that is on the head) results in patterns of constructive and destructive interference that will vary with DOA. These spectral notches "colour" the sound received by the ear. Thus, by estimating the patterns of spectral notches, it is possible to infer the DOA. Given the complexity of these patterns and the relationship with DOA, this used of this spectral approach relies on learning methods.
A further factor to consider in SSL is the effect of the environment. In general, sound sources radiate sounds in multiple directions. Surfaces that are present in the environment (e.g., walls, floor, ceiling, furniture, etc.) will reflect a portion of the incident sound. Thus, the sound signal recorded at a microphone will be a sum of the acoustic signal from the direct path between the source and the microphone and all the other paths that involve one or more reflections. In the context of DOA estimation, the paths that involve reflection will have a different DOA than that of the direct path.
In the context of human speech interactions, another key factor is the timing of turns. Previous work investigation human conversation has found that talkers start their turn approximately 200-300 ms after their partner has finished their turn [2][3][4]. To achieve human-like interactions, it is necessary for a humanoid robot to respond within a similar time frame. The latency of generating DOA estimates will limit how quickly a humanoid robot can respond to movement of a current talker or orient towards a new talker. Works such as [5,6] consider accurate DOA estimation on robotic systems, but also require a consideration for latency and turn-taking in the context of human-robot conversational scenarios.
Our work evaluates and optimizes a pipeline consisting of two main stages. The first stage continuously generates DOA estimates based on the acoustic signals received from two microphones placed on the head of a humanoid robot. The second stage categorizes these DOA estimates as being "good", that is the estimate likely corresponds with the direct signal from a human talker rather than background noise or a reverberant echo. Using a manually collected and labeled dataset, we investigate the performance of the pipeline's ability to detect direct human speech among background noise and self-generated robot sounds, accurately estimate the direction of arrival, and account for latency of detection. The unique parameters of the pipeline are optimized via either a brute force approach or a more efficient and useful Bayesian optimization approach, which sheds light on how the pipeline's performance depends on each chosen parameter.

DOA Estimation
The first main stage in the pipeline is to generate DOA estimates based on the acoustic signals received at multiple microphones. In the present study we consider the case where there are two spatially separated microphones. For this case, the simplest approach to estimate direction of arrival is to examine the cross-correlation of signals from the two microphones to estimate the difference in arrival time between the two microphones. These received signals can be streamed in real-time, or can be processed "good" (i.e., the estimate is likely correspond with the direct path of speech from a talker).

Power Onsets
If the microphone signal is split into frames with some window length, the energy in each frame will vary over time based on the fluctuations from the sound source. In a reverberant environment, when a talker stops speaking, it will take some time for the sound energy to decay. When a talker begins speaking, the sound from the direct path will arrive at the microphone before later reflections. Thus, a frame that was more energy than the previous frame (i.e., an onset) is more likely to have relatively more energy from the direct path than a frame that have less energy than the previous one.
Here we choose to use successive frame power ratios rather than differences, where an onset is detected if the power ratio exceeds a certain threshold. For a certain frame F i , The parameters δ low and δ high can be tuned and will depend on the environment of the recording. δ low indicates a minimum required change in frame power, and δ high establishes an upper limit to discriminate against very loud sounds, such as a crashing chair or slammed door. Hence, direct human speech is considered to be limited within a range of power onset values.

Speech-Reverberant Modulation Ratio
The speech-reverberant modulation ratio (SRMR) [9] is a metric that was developed towards predicting the intelligibility of speech in a given audio frame. Conceptually, anechoic speech (i.e., the direct signal) should have significant amplitude modulations between 4-16 Hz, which are related to the acoustic signals that correspond with syllables and phonemes. In the presence of reverberation, delayed and attenuated versions of this acosutic signal are summed together. This results in an increased level of envelope fluctations at higher frequencies. Thus, a ratio of the modulations at low frequencies vs. those at higher frequencies provides a measure that is related to energy of the direct signal vs. that of the reverberant components.
We apply our own lightweight implementation of the SRMR, by first using the Hilbert transform to extract the envelope of the speech frame. The frequency content of the envelope is analyzed by computing the ratio of energy present in modulation bands associate with speech and modulation bands associated with reverberant audio content. The frequencies and bandwidths for the speech and reverberant bands are specified in [9]. Overall the frame classification is performed as follows for a frame F i , where e j is the energy present in the j-th frequency band of the extracted envelope. This ratio is used as a potential measure for voice activity, and is given thresholds δ low and δ high for similar reasons as the power onsets.

Problem Statement
A number of methods have been introduced to perform the signal processing necessary for DOA estimation. These methods also involve numerical parameters, which will need to be selected for the human-robot interaction (HRI) task at hand. There is a need to identify the best parameters specifically for a binaural DOA setup on the REEM-C Humanoid Robot, which may be used in reverberant environments for the purposes of HRI. There is also a need to evaluate the implications of using these parameters in real-time, in terms of their accuracy and latency when it comes to HRI scenarios.
This work aims to tackle this problem by presenting a method to identify the best parameters for DOA estimation, including the classification methods, and numerical parameters such as frame sizes and thresholds. Parameters are optimized using a brute force and a Bayesian optimization approach, and used in a real-time implementation on the REEM-C, with a consideration for latency and potential applications for HRI.

Data Preparation
The binaural DOA setup is deployed onto the robot with a taut headband that places the microphones the head of the REEM-C at the positions that would correspond with human ears, providing a realistic and human-like appearance and configuration. A Scarlett 2i2 audio interface was used with 2 lavalier microphones. This set up was chosen as it is inexpensive and could be adapted and deployed to a wide range of robotics platforms.
Audio recordings were made in a lab environment with the robot operational. This simulates the noise that would be encountered while human-robot interaction scenarios are underway. The annotated periods of speech as well as ground truth locations of the speakers were used to properly estimate the parameters of the DOA pipeline in later sections. The dataset involves 8 recordings to fit parameters and 3 recordings to test the results. Recordings were made in a variety of conditions: stationary vs. moving human talker, while the robot was stationary or performing certain pre-defined motions such as gestures with the arm, head or torso. Other non-speech sounds may also be present, such as foot steps, shifting of chairs and tapping of lab tools against table surfaces. As explained further in the optimization approaches, a weighting is applied to the training set to favour better performance on recordings with more difficult acoustic conditions. The specifics of each recording are shown in Table 1. A good set of parameters will result in the classification that ignores the non-speech sounds but still accurately estimates the DOA of the human talker when they are speaking, even during the relatively noisy operations of the robot. The method then selects parameters with a greater probability of being under l(parameters) than g(parameters), given that l(parameters) is built from trials with more favourable objective values. This informed reasoning is used to select the next set of hyperparameters while updating the two distributions, allowing the method to find an optimal set of parameters while not exhaustively searching the entire parameter space. The TPE is implemented via the hyperopt package [10].
Given the parameter space in Table 2, the key difference from the brute force method is that the numerical parameters (frame size, step size, low and high thresholds) are now placed on a continuous distribution. The uniform distribution for the frame size and step size ensure each value has an even chance of being selected. The normal distribution parameters for the low and high thresholds are chosen based on a few tests that may indicate where good thresholds may lie, given that the high threshold must be larger than the low threshold. Table 2 outlines the parameter spaces searched for both methods.

Objective Functions
In order to properly define this optimization task, key variables are first defined.
The vector ω represents all input parameters for any single trial. which come from Table 2. The f1-score and mean squared error are then computed for a single trial with a set of parameters ω as per Eq 12. The vector σ is a set of weights to compute the weighted average of the metrics generated for the data. We aim to weigh the more complex scenarios higher than the simpler scenarios, and so trials for when extra non-speech sounds are included have a weight of 2, and trials with robot motions occurring throughout the recording have a weight of 3. We believe this weighted average will generate parameters that are better tuned to more complex auditory scenes, as opposed to a standard weighted average across the trials, where good performance in simpler scenarios may dominate the reported metrics.
The objective function used in the optimization approaches varies for each problem. The performance of the DOA estimate classification must be good, and as a result of potential imbalances in the dataset, the f1-score for classification is considered the metric to optimize. Hence the objective function is formulated for classification as follows, which computes the f1-score for every j-th trial, and aims to minimize the negative of its weighted average. July 21, 2023 10/23 For DOA estimation, the mean squared error is considered as the objective to minimize as the generated estimates and ground truth are continuous variables. The objective for DOA estimation computes the weighted average mean squared error across every j-th trial.
We also explore how to perform both optimizations at once in a joint manner. The joint optimization aims to minimize the MSE for DOA estimation and maximize F1 for classification. The objective for this method is formulated accordingly, using the two metrics as a fraction. Eq (15) shows this formulation.
A modification is added to regularize the frame size γ during the optimization. Theoretically, this should result in lower frame sizes found with good results on both DOA and classification tasks, meaning potentially lower latencies when used on the robot for real-time operation. The value of λ is set to 0.5 for this work. This objective function will be helpful to investigate the effect of frame sizes on the final results. Eq (16) shows this regularized formulation.

Results
We present results for classification, DOA and the joint performance with both parameter search methods. Qualitative evaluation was also conducted on the chosen parameters, and other considerations not included in this optimization are discussed. Initial quantitative results are presented via contours for visualization. Since there are a total of 6 dimensions to this problem, not all trends can be visualized. These results are further explored below in table form as well.

DOA Accuracy
The brute force method results are shown in Fig 7 comparing the frame size and step size to the weighted average MSE as a contour plot. The minima, shown as dark regions, occur primarily with larger frame sizes and larger step sizes. The performance on the DOA tends to worsen as the frame size or step size are reduced, indicating that the best choice for this task may require larger audio chunks when used in real-time. It is important to note that the brute force method is limited in its search as it can only evaluate discrete numerical parameters, whereas the TPE method can choose values from continuous distributions. This will have an effect on how the TPE method learns. For the sake of interpretation and evaluation, the best frame size of 339 ms was adjusted to 340 ms, and the step size was adjusted to 0.90 rather than 0.92.
Finally, these best parameters from each method are applied to the test set and generate results as in Table 4. frame sizes will be more required for more natural for HRI behaviour. The joint regularized task produces the best results with lower frame sizes, as is depicted by where the minima lie on the contour plot in Fig 11 and Fig 12. The best results for the study are then taken from the joint regularized task using the TPE method. It is important to note that regardless of how the optimization is performed, good results are rarely found for both tasks with frame sizes less than 300 ms, as shown by where the minima lie in Fig 12. We suspect that this is due to the calculations involved in computing the SRMR. For the SRMR, the process involves studying the modulation of the speech signal via its envelope, and extracting the energies present in certain bands of this envelope. The lowest frequency band for this metric was centred at 4Hz. Thus, one period of this modulation corresponds with 250 ms. Frame sizes shorter than this length may result in inaccurate estimation of the 4 Hz component of the modulation energy. Thus, the use of SRMR as it is was defined here may impose a minimum latency that is too long to achieve human-like behaviour. While increasing the minimum modulation frequency used in the SRMR would reduce the minimum latency, further work is need to determine the effect this would have on classification performance. If other classification methods are explored, minimum latencies should be less than 200 ms. Table 3 gives further insight to the numerical performance of these methods. The values indicate differences in performance when searching for different objective value minima. Numbers in bold indicate the notable differences in performance when optimizing for either loss value. For instance, using the brute force method, searching for the minimal regularized objective provides the same MSE of 0.09 as found with the unregularized objective, but gives a F1-score improvement to 0.68 from 0.61. Similarly, the TPE method sees both a decrease in MSE from 0.07 to 0.06, and an increase in F1-score from 0.66 to 0.72 when minimizing the regularized objective as opposed to the unregularized objective. This is supporting evidence that the joint objective function was appropriate for this problem as both classification and DOA tasks are performed to an acceptable level with the best parameters ω best .
Furthermore, the regularized objective also provides smaller frame sizes along with the performance gain. This is evidence that a regularized objective was helpful for the TPE in its learning process. However, since the method only evaluates a subset of the parameter space, this may come down to the randomness in its choice of parameters, explaining why the unregularized version did not find similar parameters.
The results on the test set show that these parameters are reasonable and have not overfit on the training set, as the performance on both tasks are good for all 3 test set recordings. We also see that the TPE method generates much better MSE compared to the brute force method, but slightly worse F1-score, as per Table 4. We suspect that the lower step size from the brute force method provide a greater resolution for identifying speech on the microphone signals, leading to slightly better classification performance on test data. Test 9 sees a higher MSE than the other two tests; we hypothesize that this is most likely due to the subject moving farther and closer to the robot as opposed to maintaining a similar distance as in test 10 and 11. This may simulate more realistic human-robot interaction scenarios, and could require some improvements to reflect better results.
With a functioning sound source localization pipeline in real-time, the potential for HRI can be expanded. For instance, if moving conversational partners can be detected by the robot, HRI can be augmented by implementing a human-like, realistic tracking behaviour. Motion capture analysis and modeling of the head, shoulders and feet such as in [12] can be applied for this purpose.