Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

3 directional Inception-ResUNet: Deep spatial feature learning for multichannel singing voice separation with distortion

  • DaDong Wang ,

    Roles Conceptualization, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Writing – original draft, Writing – review & editing

    jsjxywdd@jlnu.edu.cn

    Affiliation School of Mathematics and Computer Science, Jilin Normal University, Siping, Jilin, China

  • Jie Wang,

    Roles Data curation, Resources, Software, Validation

    Affiliation School of Mathematics and Computer Science, Jilin Normal University, Siping, Jilin, China

  • MingChen Sun

    Roles Writing – review & editing

    Affiliation School of Computer Science and Technology, Jilin University, Changchun, Jilin, China

Abstract

Singing voice separation on robots faces the problem of interpreting ambiguous auditory signals. The acoustic signal, which the humanoid robot perceives through its onboard microphones, is a mixture of singing voice, music, and noise, with distortion, attenuation, and reverberation. In this paper, we used the 3D Inception-ResUNet structure in the U-shaped encoding and decoding network to improve the utilization of the spatial and spectral information of the spectrogram. Multiobjectives were used to train the model: magnitude consistency loss, phase consistency loss, and magnitude correlation consistency loss. We recorded the singing voice and accompaniment derived from the MIR-1K dataset with NAO robots and synthesized the 10-channel dataset for training the model. The experimental results show that the proposed model trained by multiple objectives reaches an average NSDR of 11.55 dB on the test dataset, which outperforms the comparison model.

1 Introduction

Music training implemented by robots is more interesting than other devices. The robot must complete accompaniment, intelligent music synthesis, interactive scoring, singing voice separation,and lyric synchronization. Singing voice separation is the basis of other functions and is also important for improving robot speech recognition accuracy. Robots usually have 2–4 microphones. Recently, an increasing number of studies have focused on exploring multimicrophone source separation in real-world applications [1]. Due to their physiological structure, humans can easily distinguish between singing voice and instrumental accompaniment when listening to a song. However, this is challenging for machine or deep learning models since singing voice and accompaniment are strongly correlated in time and frequency. Moreover, multichannel singing voice separation is more challenging due to model complexity, background noise, microphone distortion, and other factors.

There are two major approaches for multichannel source separation in the early stages: microphone array processing and blind source separation (BSS) [2]. The BSS approaches usually exploit the statistical characteristics of the mixture of the singing voice, accompaniment, and noise, while the microphone array processing approaches usually concern the signal models. The ideas between the two major approaches often borrow from each other. Recently, supervised singing voice separation using deep neural networks (DNNs) has received widespread attention from researchers with great success [3]. Typically, these methods [4, 5] learn a mapping function from singing voice features to separation targets through supervised learning algorithms. Compared to signal models, deep learning can automatically extract the most powerful singing voice features in the mixture. The deep learning models can process the original high-dimensional data without knowledge requirements for feature design, mine the structured features in the singing voice, and output the structured prediction. Using DNNs in training has emerged as a promising trend in microphone array processing and BSS [6, 7].

Researchers have proposed many deep learning models for BSS, including recurrent neural networks (RNNs) [8], convolutional neural networks (CNNs) [9], U-Nets [4, 10, 11], long short-term memory (LSTM) [12], generative adversarial networks (GANs) [13, 14], etc. The results show that a DNN model trained with the singing voices of dozens of singers can separate the singing voices of others. Most DNN models [12, 14] deal with the time-frequency (T-F) domain spectrogram generated from the short-term Fourier transform (STFT). DNN models extract the spectral characteristics of the singing voice and accompaniment.

Microphone array processing approaches traditionally, utilize spatial cues such as geometry-based information to construct signal models [15]. In recent years, DNNs have emerged in multimicrophone array processing approaches for parameter estimation [16] and spatial and spectral feature extraction [7, 17]. Joint modeling of spatial and spectral information potentially improves the separation performance [2]. However, most previous DNN-based approaches do not fully utilize the spatial and spectral information of the spectrogram or lose part of the information in training, which leads to a certain residual noise in the separation results. Additionally, selecting a proper training target for singing voice separation is difficult. Single-objective loss, such as MSE or L1 loss, converges faster; however, the results are not necessarily the best, and the problem with multiobjective training is balancing multiple objectives. The code and data are available on GitHub (https://github.com/sheiaaaa/geshengfenli).

Our contributions include three aspects:

  • We propose a 3-directional Inception-ResUNet framework for improving the utilization of spatial and spectral information of the spectrogram.
  • We design a joint training objective strategy for obtaining better separation performance, which includes magnitude consistency loss, phase consistency loss and magnitude correlation consistency loss.
  • We construct a 10-channel dataset that can be used to test multichannel singing voice separation algorithms.

2 Related work

2.1 General training target

In the context of sound source separation and localization, the combined information from interchannel level differences (ILDs), interchannel phase differences (IPDs) [18], and spectrograms can be effectively leveraged using spectral magnitude masking (SMM) [19], Phase-sensitive masking (PSM) [20], or complex ideal ratio masking (cIRM) [21] as the training objectives. By employing these techniques, it is possible to improve the performance of sound source separation algorithms and achieve more accurate localization results. In general, the STFT spectrogram expressed in complex numbers consists of two kinds of information: magnitude and phase. As the training data must be scalar, some studies use magnitudes, namely, the modulus of complex numbers. Jansson et al. [4] components an audio signal by converting it to an image, processing it with a U-Net neural network, and storing the resulting spectral mask. GNU-Net [10] leverages a supervised symmetric encoder-decoder architecture for generating full-resolution feature maps. SVSGAN [14] leverages the generative adversarial network with a time-frequency masking function for singing voice separation. SMM [19] involves masking the spectrogram based on the energy distribution across frequency bins, allowing for better separation of the target sources. The spectral magnitude mask training target can be defined as the magnitude of the clean singing voice divided by that of the mixture. PSM [20] extends the SMM by multiplying cosθ, where θ denotes the difference between the clean singing voice and the mixture phase. It focuses on preserving the phase information in the separated signals, ensuring that the localization accuracy is maintained. Complex ideal ratio masking (cIRM) [21] is a more advanced technique that combines both spectral and phase information to generate masks that improve the quality of the separated signals. In addition, compared to monaural singing voice separation, multichannel singing voice separation can use spatial in addition to spectral information. The ILD and IPD can be used in training. Yilmaz et al. [18] proposed W-disjoint orthogonality for effectively separating mixture signals. Chen et al. [7] proposed a multichannel learning-based method for sound source separation in a reverberant field. Leglaive et al. [22] designed a probabilistic reverberation method for separating multichannel audio sources.

In summary, by incorporating the ILD, IPD, and spectrogram information and utilizing SMM, PSM, or cIRM as the training targets, it is possible to develop more robust and accurate sound source separation and localization algorithms. This can be particularly useful in applications such as audio postprocessing, music production, and audio enhancement, where the ability to separate and localize sound sources accurately is crucial for achieving optimal results.

2.2 Deep learning singing voice separation

Multichannel singing voice separation is a regression problem. Recently, most methods have adopted an encoder-decoder structure to solve this problem. The encoder structure typically uses convolution and pooling to extract spectral features of the clean singing voice from the mixture of the ILD, IPD, and spectrogram, while the decoder structure uses deconvolution to recover the spectrogram of the clean singing voice. As downsampling yields detail loss, upsampling is usually compensated with skip connections that connect the spectrogram with the result of upsampling in the same layer. Since Wang et al. used a 4-layer DNN to separate sources [23], dozens of methods for singing voice separation using DNNs, such as CNNs, RNNs, and various variants, have been proposed [3]. Stoter et al. used three bidirectional LSTMs to compose a benchmark system on the MUSDB18 dataset [24]. After Jansson et al. used U-Net for singing voice separation that surpassed the previous methods [4, 11], a few improved versions based on U-Net architecture achieved better performance. Qian et al. used stripe-transformer blocks to learn the deep stripe feature in encoder and decoder blocks, which are composed of residual CNN blocks [5]. Geng et al. developed a gated nested U-Net(GNU-Net) architecture to generate full-resolution feature maps [10]. Yuan et al. used genetic algorithms to search the effective MRP-CNN structures, which are composed of various-sized pooling operators, to extract multiresolution features. [25]. The above methods are spectrogram-based methods with better performance than U-Net. Simon et al. use a hybrid model in the newest Demucs system. The hybrid model has a parallel time branch in addition to the spectrogram branch [26]. Kong et al. constructed a residual U-Net architecture with a time branch and a spectrogram branch and estimated the phase by cIRMs [27]. The above two methods combine spectrogram and time encoding and decoding structures, which significantly improve the separation performance on the MUSDB18 dataset. However, when separating singing voices accompanied by noise and distortion, the separation performance of all the above methods is significantly degraded.

In summary, the U-shaped encoding and decoding network was adopted for singing voice separation. Adding components that improve network performance to downsampling and upsampling can improve the separation performance. The combination of spectrogram and time can achieve better results. All of the models mentioned above were trained on datasets without distortion.

2.3 Robot music accompaniment studies

With the rapid development of robot technology, increasing attention has been given to the combination of robots and music composition. As an interdisciplinary field, robot music accompaniment studies have attracted the attention of computer scientists, musicians, and artists, as well as bringing new possibilities for robot applications and music education. In this field, many researchers have achieved significant results, including the development of algorithms that can automatically create music [28], the combination of robots and musical instruments to achieve human‒machine collaboration [29], and the design of intelligent systems that can understand music and dance [30]. PepperOSC [31] connects the Pepper and NAO robots by leveraging sound production tools, which improves the effectiveness and attractiveness of human-robot interaction. Pluta et al. [32] leveraged a robot to explore the re-excitation of an acoustic guitar string and improved a simple synthesis model of a vibrating string based on the finite difference method. Wang et al. [33] effectively combined music and robots to make the robot accurately express music in real-time. Engstrom et al. [34] designed a robot application to play drums in rhythm to an external audio source. Qin et al. [35] developed a humanoid robot dance system driven by musical structures and emotions. Okamoto et al. [36] proposed a dancing robot system, that can make the robot listen to and dance along with musical performances. Bando et al. [37] explored sound source localization and separation in robots. Chu et al. [38] proposed a deep learning-based method to identify musical beats and styles to construct a human dancing robot. Byambatsogt et al. [39] presented a multitask learning-based model for a guitar chord recognition. Jung et al. [40] proposed a music therapy robot to alleviate depressive emotions.

In summary, robot music accompaniment studies have achieved remarkable results for algorithm development, and human-machine collaborative performance. Robot music accompaniment studies not only enrich the possibilities of robot applications and music education but also provide us with new perspectives to understand and explore music composition and performance.

3 Problem statement and formulation

Most DNN-based singing voice separation often consists of three stages [10, 12, 14]

  • Time-frequency transformation. The time domain signals of the singing voice and mixture are decomposed into two-dimensional time-frequency-domain spectrograms by STFT.
  • Construct the separation model. The model output is a soft mask that separates the mixture spectrogram into a singing voice and a nonvoice spectrogram.
  • Frequency-time transformation. The target singing voices in the time domain are reconstructed from the mixture spectrogram multiplied elementwise with the mask by inverse short-time Fourier transform (ISTFT).

The time domain mixture signal gathered by the ith microphone can be defined as follows: (1) where N denotes the number of sources, M denotes the number of microphones, k ∈ {1, 2, …, M;ki}, sk(t) denotes the signals recorded by the ith microphone from the kth source, hik(l) denotes the impulse response from the kth source to the ith microphone, and l denotes the impulse index. The spectrogram at time-frequency point (t, f) of xi(t) can be approximated as [41] (2) where hik(f) denotes the frequency response from the kth source and sk(t, f) is the STFT of sk(t). Since multiple noise sounds can be modeled as a single source [2], we denote with , and the spectrogram of music, singing voice and noise recorded by the ith microphone, respectively. xi(t, f) can be described by, (3) where xi(t, f) denotes the spectrogram of xi(t) and represents hik(f)sk(t, f).

Taking the ith microphone as a reference, we used two relative transfer functions between the ith microphone and the kth microphone [15]. The ith microphone and the kth microphone ILDs are defined as (4) The ith microphone and the kth microphone phase differences (IPDs) are calculated as (5) where ∠ denotes the phase in radians of a complex number.

We concatenate the spatial cues ILD within the real component of IPD and the imaginary component of IPD. Subsequently, we leverage the spectral features of each microphone to form the input features, which can be defined as follows: (6) where IPD(t, f).real denotes the real component of IPD and IPD(t, f).imag denotes the imaginary component of IPD.

The prediction targets are the magnitude spectrogram of the singing voice. After training, the DNN model’s output predictions, which are a time-frequency mask, can predict the magnitude spectrogram of the target singing voice from the multichannel spectrogram. The mask [20] can be defined as (7) (8) where Xi(t, f) denotes the spectrogram of the reference microphone, f = 1, 2, 3, …, F denotes different frequencies, and θ denotes the difference between the predicted singing voice phase and mixture phase of the reference microphone. We apply the soft mask to Xi to estimate the predicted separation spectrogram , which can be defined as follows: (9) (10) where ∠ denotes the phase in radians of a complex number, xi(t, f) denotes the spectrogram of xi(t), ⊗ stands for elementwise operation, denotes the real component of YiV(t, f) and YiV(t, f).imag denotes the imaginary component of YiV(t, f).

3.1 Overall architecture

The proposed model is shown in Fig 1. It consists of 6 encoder/decoder layers. The first encoder layer consists of 3 directional Inception- ResNet blocks. Both the second and third encoder layers consist of an Inception-ResNet block and a reduction block. The fourth and fifth encoder layers consist of a reduction block. In each decoder layer, we first used a fractionally strided convolution with stride 2 and kernel size 2×2, batch normalization, and LeakyReLU, then used two convolutions with stride 1 and kernel size 3×3, batch normalization, and LeakyReLU, which was followed by 4 Inception-ResNet blocks. In the final layer, we used 1 × 1 convolutions and a sigmoid activation function to output a 1-channel mask. The mask consists of three submasks: , cos(θ), and sin(θ).

4 Inception-ResUNet framework

In our approach, the songs are recorded at a sample rate of 16,000 Hz. We leverage a 1024-point window size and a 512-point hop size in STFT. Namely, the time window is 64 ms, and the overlap between two consecutive windows is 32 ms. Thus there are 32 time windows in 1 s. Each window is transformed by STFT, generating complex coefficients of 512 valid positive frequency channels between 0 and 8,000 Hz. Therefore, the signal lasting 2 s will be transformed to a 512 × 64 spectrogram.

4.1 Multichannel singing voice alignment

The recording scenarios are shown in Fig 2. The recording equipment includes a computer, a robot, and an external speaker. The computer connects the NAO robot via a wireless network. The external speaker that plays the singing voice is placed in front of the robot’s head. The computer plays the singing voice. The robot plays the accompaniment and records the mixture.

There are three types of delay during recording: network transmission delay, sound propagation delay, and processing delay. As the singing voice and the accompaniment are played on the computer and the robot, respectively, the sound propagation delay of the two sources is different. When recordings, singing voices, and accompaniments are combined into a training dataset, they must be aligned.

Let CRi denote the ith channel recordings, and V denote the singing voice, which can be defined as follows: (11) where |CRi| = N − 1, |V| = M − 1, denotes the jth data sample in the ith channel recording and vk denotes the kth original value in the singing voice. Subsequently, we slice CRi and V for singing voice alignment, which is defined as follows: (12) where denotes the recording fragment starting with p, denotes the singing voice fragment starting with q, L is the sliding window size and α ∈ [0, 1] denotes the adjustable coefficient. When and are aligned, p and q are calculated as follows: (13) where E is the mathematical mean, and and are the standard deviation of and the standard deviation of ,respectively.

4.2 3 directional Inception-ResNet blocks

The detailed structure of the 3 directional Inception-ResNet blocks is shown in Fig 3. The implementations of the two horizontally oriented Inception-ResNet blocks are similar to that of [42, 43], and each 5 × 5 2D convolution is replaced by two 3 × 3 convolutions The flip block in Fig 3 represents the flipping operation of the spectrogram in the horizontal direction. We used a scaling factor of 0.17 to scale the residuals. Ten iterations of the InceptionA block were used in the horizontal direction to cover the spectrogram. The vertically oriented Inception-ResNet block, which is implemented by 3D convolution to extract the spatial features of the multichannel spectrogram, also includes 3 branches: 1×1 convolution, 3×3 convolution, and 5×5 convolution.

4.3 Inception and reduction layers

The Inception-ResNet blocks used in the second and third encoder layers are shown in Fig 4. Each block is followed by a reduction filter. To reduce the computational cost, we use a 1 × n convolution and an n × 1 convolution to replace a n×n convolution in the Inception-B and Inception-C blocks. In the Inception-B and Inception-C blocks, a 1 × 7 convolution followed by a 7 × 1 convolution and a 1 × 3 convolution followed by a 3 × 1 convolution are used to replace 7 × 7 convolution and 3 × 3 convolution, respectively. Each block is iterated 5 times to cover the entire spectrogram. The scaling factor is set to 0.2 in the Inception-B and Inception-C blocks.

thumbnail
Fig 4. The structure of Inception-B, Inception-C, reduction, and decoder blocks.

https://doi.org/10.1371/journal.pone.0289453.g004

Convolutional networks usually use maximum or average pooling operations to reduce the size of the activation map. Maximum and average pooling are fast and memory-efficient but lose some information in the activation map. To avoid a representational bottleneck, a series of pooling methods such as power average pooling, stochastic pooling, local importance Pooling, and soft pooling, have been proposed [44]. Our implementation of the reduction block in each encoder layer is similar to that of [42]. Two parallel 3 × 3 convolutions with stride 2 are concatenated, as shown in Fig 4. One of the reduction blocks expands the filter banks to avoid the representational bottleneck [43].

4.4 Overall optimizing objective

To avoid the phase independence of the predicted spectrogram, we use a hybrid phase-dependent loss function to train m1(t, f) and m2(t, f).

(1) The magnitude consistency loss LossM can be defined as follows: (14) where Ri(t, f) denotes the normalized spectrogram of the target clean singing voice and is the mathematical expectation in data domains T and F.

(2) The phase consistency loss LossP = LossP1+ LossP2 can be defined as follows: (15) (16) where ∠ denotes the phase in radians of a complex number, θ denotes the difference between the predicted singing voice phase and mixture phase of the reference microphone, xi(t, f) denotes the spectrogram of xi(t), Ri(t, f).real denotes the real component of Ri(t, f) and Ri(t, f).real denotes the imaginary component of Ri(t, f).

(3) Magnitude correlation consistency loss LossC, which can be defined as follows: (17) The overall training objective can be defined as follows, (18) where LossM denotes the magnitude consistency loss, LossP denotes the phase consistency loss, and LossC denotes the magnitude correlation consistency loss.

5 Experiment

5.1 Data description and preprocessing

In our experiment, the NAO robot recorded a 4-channel mixture in which the accompaniment and singing voice were derived from the public MIR-1K dataset. The MIR-1K dataset is composed of 1000 clips segmenting from 110 songs. All clips are sampled at 16,000 Hz. The left and right channels of these clips record the accompaniment and singing voice, respectively. Table 1 shows parameters such as the sample rate, resolution, clip duration, number of clips, number of singers, number of channels, and total duration of the MIR-1K dataset.

The dataset production process consists of five steps: separation, recording, downloading, alignment, and combination, as shown in Fig 5. The stereo clips in the MIR-1K dataset are separated into two monaural clips: the singing voice clip and the accompaniment clip. They are played by the computer and the NAO robot during the recording process. Acoustic signals are gathered by 4 microphones and stored on the NAO robot. The 4-channel recording is aligned with the pure singing voice and accompanied by four aligned monaural singing voice clips and one aligned monaural accompaniment clip. Finally, the 4-channel recording clip, 4 aligned monaural singing voice clips, 1 aligned monaural accompaniment clip, and a monaural noise clip are combined into a 10-channel clip.

In our experiment, the song clips were recorded in an unshielded lab. The background noises also included the noise generated by the fan on the robot’s head. The song clips were sampled at 16000 Hz. L was set to 64000. As the distances between the microphones on the robot’s head are less than 12 cm, the delay deviations between the microphones are less than 0.3 ms or 5 sampling times. The experiments showed that in scenario 1, most delay deviations are 3 sampling times. In this paper, we calculated p and q for each channel. Clips with a q deviation between 2 channels exceeding 6 sampling times were discarded. The training dataset included 2,211 aligned 10-channel clips.

The SNR of the singing voice in different scenarios is shown in Fig 6. The mean SNR of the singing voice recorded by the second microphone was the largest. The second microphone was chosen as the reference microphone in our experiment.

5.2 Implementations and metrics

In the training stage, the magnitude spectrogram of the mixture was used as the network input, and the magnitude spectrogram of the singing voice was used as the target in the loss function to measure the gap between the predicted result and the singing voice. To evaluate the performance of the proposed model, we trained the proposed model on our dataset. We used 10 Inception-A blocks, 5 Inception-B blocks, and 5 Inception-C blocks in the proposed model. A total of 448 clips were randomly selected in the dataset for training. We randomly selected 448 clips from the 2211 clips for training and 643 clips for testing the performance of the proposed model, and these clips contained each singer’s singing voice in different scenarios. We trained each network for 100 epochs. The optimizer was ADAM. The learning rate was set to 0.00001. The batch size was set to 8. To compare the performance with other models, we also trained the model on the MUSDB18 and MIR-1K datasets for monaural singing voice separation.

To evaluate the quality of separation, the source-to-distortion ratio (SDR), source-to-interferences ratio (SIR), and sources-to-artifacts ratio (SAR) were taken as objective evaluation criteria [46]. A higher value for each ratio indicates better separation. We used the BSS EVAL toolbox to calculate SDR, SIR, and SAR. We also calculated the normalized SDR (NSDR) provided in the BSS EVAL toolbox.

5.3 Performance nalysis

The performance of the proposed model was evaluated on our dataset,the MUSDB18 dataset, and the MIR-1K dataset. Table 2 shows the ablation study on the dataset. The model trained with the mixture, IPD, and ILD for the PSM target achieved better performance than the SMM target. The model with the 3DIR blocks has better performance than the others. The performance of the model with 2 directional Inception-ResNet (2DIR) blocks or 1 directional Inception-ResNet (1DIR) block was improved by adding IPD and ILD to the mixture, while the performance improvement was not obvious for the model with 3 directional Inception-ResNet blocks. In other words, the model with 3 directional Inception-ResNet blocks has the ability to extracted the spatial features of the spectrogram.

The separation results for the four scenarios are shown in Fig 7. The means of NSDR and SAR in scenario 1 are the lowest. The main reason is that the SNR in scenario 1 is the lowest. The mean NSDR, SIR, and SAR for the other three scenarios were relatively close. The results show that the model can effectively separate the singing voice in different directions.

Table 3 shows the separation performance on the dataset with different objectives. We can see that the model trained with the LossM, LossPand LossC objectives achieved higher NSDRs, SIRs, and SARs than the two objectives. The LossC objective significantly improved the mean NSDR and SAR on the dataset. Table 4 compares the separation performance of our proposed model with the U-Net and Demucs models. Both models were trained on SMM. The U-Net model was trained with a LossM objective like [4]. The results show that the proposed model achieved higher NSDR, SIR, and SAR than U-Net and Demucs. Due to the distortion of the singing voice recorded by a robot, the Demucs model, which was good at separating the undistorted singing voice, did not achieve a higher separation performance.

thumbnail
Table 3. Singing voice mean score on the dataset with different objectives.

https://doi.org/10.1371/journal.pone.0289453.t003

thumbnail
Table 4. Singing voice mean score on the dataset for different models.

https://doi.org/10.1371/journal.pone.0289453.t004

To compare the performance of Inception-ResUNet with other models, we also trained the Inception-ResUNet model on the MUSDB18 dataset and MIR-1K dataset with 2 directional Inception-ResNet blocks, PSM, and LossM + LossP. Table 5 shows the comparison of Inception-ResUNet with other models for SDR, SIR, and SAR on the MUSDB18 dataset.

As shown in Table 5, the proposed model achieved 7.85 dB on the vocal SDR category and 13.66 dB on the accompaniment SDR category on the MUSDB18 dataset, which outperforms Open Umix, E-MRP-CNN, and D3Net. As shown in Table 6, the proposed model achieves 12.73 dB on the vocal NSDR category and 12.53 dB on the accompaniment NSDR category on the MIR-1K dataset, which outperforms E-MRP-CNN, U-Net, and RPCA-DRNN.

thumbnail
Table 5. Comparison of SDR, SIR and SAR of other methods and Inception-ResUNet on the MUSDB18 dataset.

https://doi.org/10.1371/journal.pone.0289453.t005

thumbnail
Table 6. Comparison of NSDR, SIR and SAR of other methods and Inception-ResUNet on the MIR-1K dataset.

https://doi.org/10.1371/journal.pone.0289453.t006

The results of the real-time performance evaluation of the model are shown in Fig 8. A clip with a duration of 6.11 seconds was separated 30 times on two different GUPs. The processing time on GeForce RTX 2080Ti (Linux) was less than 0.68 seconds and less than 3.0 seconds on Quadro RTX 3000 (Windows), both of which were much shorter than the duration of the clip.

thumbnail
Fig 8. Average processing time (s) on different GPUs of clip “amy_4_02”.

https://doi.org/10.1371/journal.pone.0289453.g008

6 Discussion

The separation performances of most singing voice separation methods for monaural recordings degrade for separating distorted singing voices. The main reason is that the distortion of the spectrogram is also proportionally reserved. Unfortunately, the multichannel mixture recorded by ordinary robots was distorted, as shown in position 1 in Fig 9. The model trained by LossM and LossP [4, 25, 49] preserved the distortion, as shown in position 2 in Fig 9. When a model was trained on multiple objectives, improvements in one objective degraded the others. The proposed model trained by LossM + LossP + lossC did not achieve the best separation performance on the MUSDB18 dataset. Our experiments showed that a model trained by LossM + LossP + lossC had a lower SDR. However, when LossC was used to separate distorted singing voices, LossC can significantly improved separation performance. lossC reduced the distortion of the singing voice, and the benefits outweighed the reduction.

thumbnail
Fig 9. LossC corrects the distortion of singing voice separation.

https://doi.org/10.1371/journal.pone.0289453.g009

7 Conclusion

In this paper, we proposed a novel model, 3D Inception-ResUNet, for separating the multichannel singing voice with distortion. We trained the proposed model with multiple objectives: magnitude correlation consistency loss, magnitude consistency loss, and phase consistency loss. We recorded multichannel singing voices on robots and produced a 10-channel dataset to test multichannel singing voice separation algorithms. The output of the proposed model was a set of singing voice masks that could be used to transform the magnitude and phase spectrogram of the mixture into the singing voice. The experimental results show that the proposed model achieved higher performance on multichannel singing voice separation with distortion.

Supporting information

References

  1. 1. Cobos M, Ahrens J, Kowalczyk K, Politis A. An overview of machine learning and other data-based methods for spatial audio capture, processing, and reproduction. EURASIP Journal on Audio, Speech, and Music Processing. 2022; 1: 1–21
  2. 2. Gannot S, Vincent E, Markovich-Golan S, Ozerov A. A Consolidated Perspective on Multi-Microphone Speech Enhancement and Source Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2017;4: 692–730.
  3. 3. Wang DL, Chen JT. Supervised Speech Separation Based on Deep Learning: An Overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2018;10: 1702–1726. pmid:31223631
  4. 4. Jansson A, Humphrey E, Montecchio N, Bittner R, Kumar A, Weyde T. Singing voice separation with deep U-Net convolutional networks. In Proceedings of the 18th International Society for Music Information Retrieval Conference. 2017;10: 323–332.
  5. 5. Qian J, Liu X, Yu Y, Li W. Stripe-Transformer: deep stripe feature learning for music source separation. EURASIP Journal on Audio, Speech, and Music Processing. 2023;1: 2
  6. 6. Heymann J, Drude L, Haeb-Umbach R. Neural network based spectral mask estimation for acoustic beamforming. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. 2016;3: 196–200.
  7. 7. Chen YS, Lin ZJ, Bai M. A multichannel learning-based approach for sound source separation in reverberant environments. EURASIP Journal on Audio, Speech, and Music Processing. 2021 Nov 20.
  8. 8. Huang PS, Kim M, Hasegawa-Johnson M, Smaragdis P. Singing-voice separation from monaural recordings using deep recurrent neural networks. In 15th International Society for Music Information Retrieval Conference. 2014;10: 477–482.
  9. 9. Chandna P, Miron M, Janer J, Gómez E. Monoaural Audio Source Separation Using Deep Convolutional Neural Networks. Lecture Notes in Computer Science. 2017;2: 258–266.
  10. 10. Geng HB, Hu Y, Huang H. Monaural Singing Voice and Accompaniment Separation Based on Gated Nested U-Net Architecture. symmetry. 2020;6: 1051
  11. 11. Hennequin R, Khlif A, Voituret F, Moussallam M. SPLEETER: a fast and state-of-the-art music source separation tool with pre-trained models. The Journal of Open Source Software. 2020;50: 1–4.
  12. 12. Vincent E, Gribonval R, Fevotte C. Hybrid speech recognition with deep bidirectional lstm. In 2013 IEEE workshop on automatic speech recognition and understanding. 2013;12: 273–278.
  13. 13. Pascual S, Bonafonte A, Serra J. Segan: Speech enhancement generative adversarial network. In Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden. 2017 Mar 28.
  14. 14. Fan ZC, Lai YL, Jang J. SVSGAN: Singing voice separation via generative adversarial networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, Calgary, Alberta, Canada. 2018;4: 726–730.
  15. 15. Wang ZQ, Wang DL. Integrating Spectral and Spatial Features for Multi-Channel Speaker Separation. Interspeech. 2018;9: 2718–2722.
  16. 16. Heymann J, Drude L, Chinaev A, Haeb-Umbach R. BLSTM supported GEV beamformer front-end for the 3rd CHi ME challenge. 2015 IEEE Workshop on Automatic Speech Recognition and Understanding. 2016 Feb 11.
  17. 17. Souden M, Araki S, Kinoshita K, Nakatani T, Sawada H. A multichannel MMSE-based framework for speech source separation and noise reduction. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2013;9: 1913–1928.
  18. 18. Yilmaz O, Rickard ST. Blind separation of speech mixtures via time-frequency masking. IEEE Transactions on Signal Processing: A publication of the IEEE Signal Processing Society. 2004 Jun 21.
  19. 19. Wang Y, Narayanan A, Wang D. On Training Targets for Supervised Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2014;12: 1849–1858. pmid:25599083
  20. 20. Erdogan H, Hershey JR, Watanabe S, Roux JL. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. 2015;4: 708–712.
  21. 21. Williamson DS, Wang Y, Wang DL. Complex ratio masking for monaural speech separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2016;3: 483–492. pmid:27069955
  22. 22. Leglaive S, Badeau R, Richard G. Multichannel audio source separation with probabilistic reverberation modeling. In 2015 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. 2015;10: 1–5.
  23. 23. Wang Y, Wang D. Towards Scaling Up Classification-Based Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2013;7: 1381–1390.
  24. 24. Stoter FR, Uhlich S, Liutkus A, Mitsufuji Y. Open-unmix-a reference implementation for music source separation. Open source software. 2019;4: 1667.
  25. 25. Yuan W, Dong B, Wang S, Unoki M. Evolving Multi-Resolution Pooling CNN for Monaural Singing Voice Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021;1: 807–822.
  26. 26. Simon R, Francisco M, Alexandre D. Hybrid Transformers for Music Source Separation. arXiv:2211.08553v1. Available from: https://doi.org/10.48550/arXiv.2211.08553.
  27. 27. Kong Q, Cao Y, Liu H,et al. Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation. arXiv:2109.05418.2021. Available from: https://doi.org/10.48550/arXiv.2109.05418
  28. 28. Chen MY, Wei W, Chao HC, et al. Robotic musicianship based on least squares and sequence generative adversarial networks. IEEE Sensors Journal. 2021;18: 17646–17654.
  29. 29. Scimeca L, Ng C, Iida F. Gaussian process inference modelling of dynamic robot control for expressive piano playing. Plos one. 2020 Aug 14. pmid:32797107
  30. 30. Lee M, Lee K, Lee M, et al. Dance motion generation by recombination of body parts from motion source. Intelligent Service Robotics. 2018;11: 139–148.
  31. 31. Latupeirissa AB, Bresin R. PepperOSC: enabling interactive sonification of a robot’s expressive movement. Journal on Multimodal User Interfaces. 2023 Sep 9.
  32. 32. Pluta MJ, Tokarczyk D, Wiciak J. Application of a Musical Robot for Adjusting Guitar String Re-Excitation Parameters in Sound Synthesis. Applied Sciences. 2022;3: 1659.
  33. 33. Wang CQ. Interactive Display of New Media’s Intelligent Robots for the Music Culture Industry. Mobile Information Systems. 2022;2022: 5386819.
  34. 34. Engstrom M. Audio Beat Detection with Application to Robot Drumming. Portland State University. 2019 Jan 10.
  35. 35. Qin R, Zhou C, Zhu H, et al. A music-driven dance system of humanoid robots. International Journal of Humanoid Robotics. 2018;5: 1850023.
  36. 36. Okamoto T, Shiratori T, Kudoh S, et al. Toward a dancing robot with listening capability: keypose-based integration of lower-, middle-, and upper-body motions for varying music tempos. IEEE Transactions on Robotics. 2014;3: 771–778.
  37. 37. Bando Y, Masuyama Y, Sasaki Y, et al. Robust auditory functions based on probabilistic integration of music and cgmm. IEEE Access. 2021;9: 38718–38730.
  38. 38. Chu Y. Recognition of musical beat and style and applications in interactive humanoid robot. Frontiers in Neurorobotics. 2022;16: 875058. pmid:35990882
  39. 39. Byambatsogt G, Choimaa L, Koutaki G. Guitar chord sensing and recognition using multi-task learning and physical data augmentation with robotics. Sensors. 2020;21: 6077. pmid:33114599
  40. 40. Jung YH, Jeong SW. Development of content for the Robot that Relieves depression in the Elderly using music therapy. The Journal of the Korea Contents Association. 2015;2: 74–85.
  41. 41. Araki S, Sawada H, Mukai R, Makino S. Underdetermined blind sparse source separation for arbitrarily arranged multiple sensors. Signal. Processing. 2007;8: 1833–1847.
  42. 42. Szegedy C, Ioffe S, Vanhoucke V, Alemi AA. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the AAAI conference on artificial intelligence. 2017;5: 1.
  43. 43. Szegedy C, Vanhoucke V, Ioffe S, Shlens J. Rethinking the Inception Architecture for Computer Vision. In IEEE Conference on Computer Vision and Pattern Recognition. 2016;6: 2818–2826
  44. 44. Stergiou A, Poppe R, Kalliatakis G. Refining activation downsampling with SoftPool. In 18th IEEE/CVF International Conference on Computer Vision. 2021;10: 10337–10346.
  45. 45. Hsu CL, Jan JSR. On the Improvement of Singing Voice Separation for Monaural Recordings Using the MIR-1K Dataset. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2009;1: 310–339.
  46. 46. Vincent E, Gribonval R, Fevotte C. Performance measurement in blind audio source separation. IEEE transactions on audio, speech, and language processing. 2006; 4: 1462–1469.
  47. 47. Takahashi N, Mitsufuji Y. D3Net: Densely connected multidilated DenseNet for music source separation. 2020 Oct 5.
  48. 48. Lai WH, Wang SL. RPCA-DRNN technique for monaural singing voice separation. EURASIP Journal on Audio, Speech, and Music Processing. 2022;1: 4.
  49. 49. Kong QQ, Cao Y, Liu HH, Choi K, Wang YX. Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation. In Proceedings of the 22nd International Society for Music Information Retrieval Conference. 2021 Sep 12.