Deep causal speech enhancement and recognition using efficient long-short term memory Recurrent Neural Network

Long short-term memory (LSTM) has been effectively used to represent sequential data in recent years. However, LSTM still struggles with capturing the long-term temporal dependencies. In this paper, we propose an hourglass-shaped LSTM that is able to capture long-term temporal correlations by reducing the feature resolutions without data loss. We have used skip connections in non-adjacent layers to avoid gradient decay. In addition, an attention process is incorporated into skip connections to emphasize the essential spectral features and spectral regions. The proposed LSTM model is applied to speech enhancement and recognition applications. The proposed LSTM model uses no future information, resulting in a causal system suitable for real-time processing. The combined spectral feature sets are used to train the LSTM model for improved performance. Using the proposed model, the ideal ratio mask (IRM) is estimated as a training objective. The experimental evaluations using short-time objective intelligibility (STOI) and perceptual evaluation of speech quality (PESQ) have demonstrated that the proposed model with robust feature representation obtained higher speech intelligibility and perceptual quality. With the TIMIT, LibriSpeech, and VoiceBank datasets, the proposed model improved STOI by 16.21%, 16.41%, and 18.33% over noisy speech, whereas PESQ is improved by 31.1%, 32.9%, and 32%. In seen and unseen noisy situations, the proposed model outperformed existing deep neural networks (DNNs), including baseline LSTM, feedforward neural network (FDNN), convolutional neural network (CNN), and generative adversarial network (GAN). With the Kaldi toolkit for automated speech recognition (ASR), the proposed model significantly reduced the word error rates (WERs) and reached an average WER of 15.13% in noisy backgrounds.


Introduction
Speech enhancement is a signal processing technique that aims to improve the quality and intelligibility of speech signals that are degraded by various types of noise, such as background noise, reverberation, and channel distortions.In practice, speech enhancement techniques typically operate in the time-frequency domain, where the speech signal is represented as a sequence of short-time Fourier transforms (STFTs).By analyzing the speech signal in the frequency domain, it is possible to identify and isolate the components that are corrupted by noise, while preserving the underlying speech components.Speech enhancement is a crucial component in many applications, such as hearing aids, telecommunication systems, and speech recognition systems.By improving the quality and intelligibility of speech signals, speech enhancement techniques can significantly enhance the performance and usability of these systems.There are many different signal processing techniques that can be used for speech enhancement, such as spectral subtraction [1], Wiener filtering [2], and non-negative matrix factorization (NMF) [3].These techniques aim to reduce or remove the noise component from the speech signal while preserving the speech content.One of the key challenges in speech enhancement is to distinguish between the desired speech signal and the noise component.This is particularly challenging in the presence of non-stationary noise, which can vary in both time and frequency domains.To overcome this challenge, speech enhancement systems often use adaptive algorithms that can track the changes in the noise statistics and adjust the filtering parameters accordingly.The performance of speech enhancement systems is typically evaluated using objective measures, such as signal-to-noise ratio (SNR) and perceptual evaluation of speech quality (PESQ), as well as subjective listening tests.In general, speech enhancement can significantly improve speech signals' perceived quality and intelligibility, particularly in noisy environments.
Deep learning techniques have shown a lot of promise in improving speech enhancement performance in non-stationary noisy environments, where the characteristics of the noise may change over time [4][5][6], and show its effectiveness in other applications [7][8][9][10].Deep neural networks (DNNs) are effective models for speech enhancement because they can learn the nonlinear relationship between input and output features.In particular, deep learning-based speech enhancement models, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Deep Neural Networks (DNNs), can learn to extract features that are robust to various types of noise and can adapt to changing noise conditions over time.Deep learning-based speech enhancement models have shown significant improvements in speech quality and intelligibility, particularly in challenging environments, such as noisy speech in cars, on cell phones, or in crowded public places.There are two main types of DNNbased speech enhancement algorithms: masking-based [11,12] and mapping-based [13][14][15].Masking-based algorithms have been found to be more effective because they can estimate time-frequency (T-F) masks as training targets, which can better track the target speaker and produce better de-noising results.Fully connected feedforward DNNs (FDNNs) have been commonly used in speech enhancement, but they are limited by short context windows and cannot capture long-term context information.Multi-layer networks are used in DNN-based speech enhancement methods to overcome this limitation and provide better performance in non-stationary noisy environments.Overall, DNN-based speech enhancement techniques can provide superior de-noising results without requiring statistical features or distribution assumptions.However, they require large amounts of training data and computational resources, which can be a limitation in some applications.
Recurrent Neural Networks (RNNs) are a type of neural network that can process sequential data and capture long-range temporal dependencies.They are particularly well-suited for natural language processing (NLP) tasks that involve variable-length sequences of data, such as speech waveforms, text, and time series.RNNs have also been successfully used for other NLP tasks, such as speech recognition and dialogue modeling.For example, in speech recognition, RNNs can be used to model the relationship between an input speech waveform and its corresponding text transcription.According to research [16,17], it is preferable to structure speech enhancement as a sequence-to-sequence process in order to regulate long-term context windows.RNNs [18], CNNs [19], and GANs [20] have been presented where networks are trained and evaluated with various noise types and speakers of both genders.The authors propose a four-hidden-layer LSTM model for speaker generalization [16].Regarding speech intelligibility, the findings demonstrated that the LSTM model generalized better to untrained speakers and significantly outperformed a DNN-based model.Numerous studies demonstrate that with sequence-to-sequence processing, LSTM may successfully manage long-term context windows and be effective in SE [21,22].The difficulty of capturing long-term dependencies is a crucial obstacle RNN models face when attempting to model extended sequences of input data.In addition, training RNNs via Back Propagation Through Time (BPTT) exposes gradients to vanishing and explosion.LSTM [23,24] and gated recurrent unit (GRU) [25,26] are examples of RNN variations that use unique transition functional units and optimization strategies to address these difficulties.Layered RNNs [27] and skip RNNs are two of the existing focused architectures [16].A causal dynamic model using attention LSTM encoder-decoder is proposed for SE with excellent noise reduction and speech recognition results- [28].A timedomain brain-assisted speech enhancement model incorporates electroencephalography signals to extract the target speaker from monaural speech mixtures.The proposed SE model is based on the fully convolutional time-domain network [29].Another study [30] proposes a cooperative attention-based speech enhancement model and combines local and non-local attention operations in a learnable and self-adaptive manner.The study [31] proposes a multiscale attention metric generative adversarial network to avoid the mismatch between the objective function used to train the speech enhancement models and introduces the attention mechanism in the metric discriminator.Another study uses a Convolutional attention transformer bottleneck in the encoder-decoder framework for speech enhancement and obtains better SE and automatic speech recognition results [32].
In this paper, we describe LSTM models that are capable of capturing long-term temporal correlations and avoiding gradient decay across layers.The significant contributions of this study are emphasized as follows.(i) It is suggested that an hourglass-shaped LSTM model can capture long-term temporal sequence-to-sequence data and decrease feature resolutions without data loss in layers.(ii) In order to avoid gradient decay in nonadjacent layers, skip connections are introduced.(iii) In the skip connections, an attention gate is utilized to suppress irrelevant input and emphasize the critical spectral regions of features.(iv) Combined feature sets are extracted from the noisy speech to train LSTM models reliably.(v) IRM is estimated to be the training target for suppressing the additive noise from the target speech in order to obtain higher-quality and more intelligible speech.
The remainder of this paper is organized as follows.The proposed speech enhancement system is explained in Section 2. Experiments and setups are presented in Section 3. Results and discussions are presented in Section 4. Finally, conclusions are drawn in Section 5.

Problem formulation
Consider that a clean speech signal s(t) is deteriorated by additive background noise d(t) and that the resultant noisy speech y(t).Using the short-time Fourier Transform (STFT), the noisy speech y(t) is transformed into the frequency domain, yielding the frequency-domain representation of y(t) as |Y(f, t)|, where t represents the frame index and f represents the frequency index.A combined set of acoustic features is extracted to train the LSTM model reliably

Proposed LSTM architecture
LSTMs are capable of capturing information from speech waveforms, which are essentially long-term temporal sequences.Using the following novel approach, the network has successfully circumvented the RNN's constraints.This suggested LSTM model is influenced by research by Abdulbaqi [33].LSTM layers are first organized using an hourglass-shaped design.For the top pyramid (first two layers to the third layer), the number of time steps decreases as the number of neurons increases.Similarly, in the bottom pyramid (third to final two levels), the time steps are increasing while the number of neurons is decreasing.Instead of the typical LSTM's fixed neurons and time steps, we have employed an alternative technique to produce a compact and effective model for speech enhancement.The outputs of the model have been modified to favor fewer time steps.Reshaping the layer output to lower and increase the time steps eliminates data loss and enables the model to have a suitable number of neurons.With these architectural modifications, the model can manage high-resolution features without exceeding memory capacity and with fewer network parameters.Second, skip connections are used between pyramid layers of similar shape from the top pyramid to the bottom pyramid.Thus, the decreasing gradient across layers is maximized.Thirdly, the attention gate is used in skips to emphasize significant spectral areas.The speech spectrum contains formants with a sparse distribution in high-frequency regions and a predominance in low-frequency regions.Consequently, it is essential to differentiate the various spectral areas with varying weights using an attention gate.Fig 2 depicts the network's five LSTM layers and two attention skip connections.Table 1 expresses the time steps and units.The network finds the nonlinear relationship and converts the noisy speech signal y(t) into a clean speech signal.Using the following equations where the forget gate is important.
where W i , W f , W o , are weight matrices of input, forget, and output gate associated with hidden states, x t is input to the current timestamp, h t−1 is hidden state of the previous timestamp, C t−1 and C t shows the previous and current timestamp respectively whereas b i , b f , and b o are the biased terms of input, forget, and output gate, respectively.In the architecture, LSTMs are favored over RNNs because of their gated structure, superior training, and superior SE performance.This compact LSTM design enhances network capacity by sharing the hidden states across the similar and bottom layers.The lowering time-steps and increasing units (from upper layers to the middle layer) and increasing time-steps and decreasing units (from the middle layer to the bottom layer) allow for a more accurate portrayal.
The LSTM layers share their hidden states, hence the hidden states of an LSTM unit in layer l at time t are obtained by concatenating its hidden states, which are dependent on the lower layer l-1 at time t and this layer at time t-1.Before the skips, the hidden states of the top and lower layers are merged to form a final output with the same size as the input vectors.
The output will be created by combining the hidden states of all layers as: where X indicates the output from the last layer while h 5 T indicates hidden states of last layer in the architecture.To avoid gradient decaying over the layers, two skips are added.The skips provide deep training and effective generalization after the combination of low-level features with high-level features.Speech spectra include different frequency components; the formants are usually dominant in the low-frequency regions and demonstrate a sparse distribution in the high-frequency regions.Hence, it is important to distinguish different spectral regions with different weights by using an attention process.Moreover, important regions and features are focused on improving the quality of output.

Features combination
At the frame level, the feature sets are obtained from the speech signals.The frame shift and lengths were set at 10 and 20 milliseconds, respectively.These feature sets are comprised of 31-dimension Mel-Frequency Cepstral Coefficients (MFCC), 64-dimension Gammatone Filter-bank Energies (GFE), 15-dimension Amplitude Modulation Spectrogram (AMS), and 13-dimensions relative Spectral Transformed Perceptual Linear Prediction Coefficients (RAS-TA-PLP), given as: Here d denotes the dimensions of features, f S and f Y are the combined feature vectors of clean and noisy speech.The gamma tone filterbank energies features are derived from the Cochleagrams, which is a T-F representation often employed in computational auditory scene analysis (CASA).It explains the operation of the human auditory system.A filter bank of 64 channels is used to generate the Cochleagrams.The delta features are also calculated and attached to the features.Table 2 briefly compares the models in terms of features, training objective, DNN type, and loss function.

Datasets
Various tests are undertaken using speech sentences selected from the TIMIT [34], LibriSpeech [35], and VoiceBank [36] to evaluate the performance of SE.LibriSpeech comprises 1000 hours of speech data at a 16 kHz sampling rate.The TIMIT also contains phonetically balanced speech data at a sampling rate of 16 kHz.The Voice Bank is composed of male and female speakers of the English language.In our research, only clean speech samples from databases were utilized.The Aurora-4 database [37], NOISEX-92 database [38], and DEMAND database [39] are selected to obtain background noises for evaluating the proposed speech enhancement methods.Four input SNRs (-8 dB, -4 dB, 0 dB, and 4 dB) are utilized to create noisy sentences.To train the proposed LSTM network, sentences from VoiceBank, TIMIT, and LibriSpeech are used in order to estimate the T-F mask.For a more accurate generalization of the speaker, the training sentences include male and female speakers combined with all noise sources.Consequently, a large quantity of speech sentences is selected for model training.In addition, a separate set of speech sentences is prepared at random from three databases (TIMIT, LibriSpeech, and VoiceBank) for model testing.Only two noise sources are excluded from training and these noises are termed unseen noises(factory2 and cafe ´).

Network setting
In this article, a five-layered LSTM network is used where the input layer has a size of 1230 dimensions using the context windows of 11 frames.Every layer of the LSTM is comprised of N units and M time steps, while the output layer consists of 257 units.The BPTT (Backpropagation through time) is employed during training.Optimization is performed using adaptive gradient descent with momentum.There are 512 samples in each batch.During processing, the AGD scaling factor is fixed at 0.0010 whereas the learning rate is reduced linearly from 0.06 to 0.002.There are 80 epochs in all. is set at 0.4 for the first epochs, then momentum is raised to 0.8 for subsequent epochs.With a dropout rate of 0.2, dropout regularisation is implemented.During mask estimation, the MSE loss function is applied.The LSTM models do not employ future information, which is equivalent to causal systems.11 frames of features

Evaluation metrics
Experiments use two objective metrics to objectively assess the proposed speech enhancement method.STOI (short-time objective intelligibility) and PESQ (perceptual evaluation of speech quality) determine the intelligibility and quality, respectively.ITU-T P.862 guideline PESQ [40] assesses the perceptual quality of noisy speech (between -0.5 to 4.5).STOI [41] assesses the intelligibility of noisy speech with values from 0.00 to 1.00.A monotonic nonlinear mapping was used to calculate the percentage of correct words based on the STOI findings.Applying a mapping function to the STOI data yields the projected intelligibility scores in this study.The two metrics are: Where c = -17.49,d = 9.692, α0 = 4.5, α1 = -0.1,α3 = -0.039,D 1 denotes the symmetric disturbances while D 2 denotes the asymmetric disturbances, respectively.

Representation of algorithm
Various SE systems are designed with an interpretation indicating the neural network type, with and without skip connections and mask type.

Results and discussions
Table 4 presents an evaluation of the proposed SE using STOI in three seen noises.The proposed LSTM model using the combined features and attention skips outscored the networks that are using no skips or using skips with no attention.We observed better STOI (intelligibility) and PESQ (quality) than the counterparts and unprocessed noisy speech with the proposed model.For example, the LSTM-AttenSkips-IRM improved the STOI by 7.7% over unprocessed speech (UNP stands for noisy speech) at -8dB babble noise.Similarly, LSTM-AttenSkips-IRM increased STOI by 23.9% over unprocessed speech at -4dB of car noise.Also, at 0dB factory noise, LSTM-AttenSkips-IRM increased STOI by 20.2% over unprocessed noisy speech.In comparison to the LSTM-WithSkips-IRM, the proposed models with attention skips improved the STOI by 2.1% at -8dB babble noise.Also, the proposed model with attention skips improved the STOI by 9.1% with LSTM-NoSkips-IRM at -8dB babble noise.As a whole, the LSTM-AttenSkips-IRM outperformed and increased average STOI over unprocessed noisy speech as well as SNRs by 1.23%.Table 5 evaluates the proposed SE models in terms of PESQ for three seen noise types with IRM as an estimated training target.For the PESQ, the suggested LSTM model with combined feature sets and attention skips outscored other models that have no skips or skips with no attention mechanism.We achieved a better perceptual speech quality as compared to the counterparts and noisy speech with the proposed models.For example, in Table 4, the LSTM-AttenSkips-IRM improved the PESQ by 0.34 (20.98%) over unprocessed speech at -8dB factory noise.Similarly, LSTM-AttenSkips-IRM improved the PESQ by 0.54 (26.21%) over unprocessed speech at -4dB babble noise.Moreover, at 0dB car noise, the LSTM-AttenSkips-IRM improved the PESQ by 1.04 (39.1%) over the noisy speech.In contrast to the LSTM-WithSkips-IRM, the proposed models with attention skips improved the PESQ by 0.09 (3.04%) at 4dB car noise.It indicates that at good SNRs (SNR�4dB) the proposed LSTM model performs almost similarly.In addition, the proposed model with attention skips improved the PESQ by 0.14 (5.28%) with LSTM-NoSkips-IRM at 4dB babble noise.Again, the LSTM-AttenSkips-IRM outscored and increased the average PESQ score over the unprocessed noisy speech as well as SNRs by 3.07%.The results indicate that LSTM-AttenSkips achieved better PESQ and STOI values.The average PESQ and STOI improvements (PESQi and STOIi) in background noises are depicted in Figs 3 and 4, respectively.
In other sets of experiments, we used the LibriSpeech dataset and Ideal Binary Mask (IBM) to evaluate the proposed SE models.The LibriSpeech is obtained from audiobooks and is composed of 1000 hours of speech sampled at 16 kHz.In experiments, we selected only clean utterances and again mixed them with noise types: airport, babble, street, cafeteria, and car noise at the same SNRs.The average PESQ and STOI values using 5 noises are given in Table 6.The LSTM-WithSkips-IRM and LSTM-WithSkips-IBM have increased the average STOI by 16.44% and 14.9% over unprocessed noisy speech.Further, the LSTM-AttenSkips-IRM and LSTM-AttenSkips-IBM have increased the average PESQ scores with 0.78 (33.19%) and 0.71 (31.14%) over unprocessed noisy speech.We used the VoiceBank dataset to further evaluate the proposed SE models.In experiments, we selected only clean utterances and again mixed them with noise types: airport, babble, street, cafeteria, car, sports field, and well-visited city park noise at the same SNRs.The average STOI and PESQ scores for different noises are given in Table 7.The LSTM-AttenSkips-IRM and LSTM-AttenSkips-IBM have increased the average STOI by 17.21% and 15.4% over unprocessed noisy speech.In addition, LSTM-AttenSkips-IRM and LSTM-AttenSkips-IBM have increased the average PESQ with 0.81 (35.22%) and 0.75 (34.31%) over unprocessed noisy speech.

Generalization performance
To examine the proposed SE models in terms of generalization, Table 8 provides the PESQ and STOI scores in two unseen noise types (factory2 and cafeteria).The proposed SE models outscored the baseline and the competing networks with significant margins in unseen noises.During analysis, it is observed that the proposed LSTM-WithSkips-IRM and LSTM-WithSkips-IBM obtained the highest 7intelligibility (STOI) and perceptual quality (PESQ) scores since the network architecture is modified to obtain better results.As the suggested models have been treated using robust acoustic feature sets and modifications, their performances are not drastically altered both in unseen or seen noisy conditions.The average STOI values have increased from 63.1% to 78.0% and 76.8% with LSTM-WithSkips-IRM and LSTM-WithSkips-IBM, improving the STOI by 14.9% and 13.7% over unprocessed speech.At low SNRs such as -4dB and -8dB, LSTM-WithSkips-IRM and LSTM-WithSkips-IBM have increased STOI by  1.90% and 1.80% over the baseline LSTMs (LSTM with IRM and LSTM with IBM).Further, the average PESQ values are increased from 1.50 to 2.22 (32.43%) and 2.17 (31.90%) with LSTM-WithSkips-IRM and LSTM-WithSkips-IBM, improving the PESQ significantly over the UNP in unseen noisy conditions.The proposed LSTM models have increased STOI by 1.80% and 2.90% over the baseline LSTMs.The proposed models have increased PESQ by 0.10 (4.54%) and 0.16 (7.27%) over the baseline LSTMs.The proposed models for SE achieved the best performance in unseen noises.The computational load of the proposed model is measured with trainable parameters and FLOPs (floating-point operations), useful metrics for calculating computational complexity and optimizing the performance on specific hardware platforms.The parameters count and FLOPs for the proposed LSTM model are 26.47M and 127.72 [G] whereas the parameters count and FLOPs for the baseline LSTM are 53.93M and 245.67 [G], respectively indicating the better performance in terms of model complexity and trainable parameters.

Comparisons with other DL methods
This section examines the performance in terms of average values (STOI and PESQ) obtained by the proposed models and the competing DL models.The experimental results indicate that the proposed LSTM models improved the speech quality, intelligibility, noise suppression, and speech distortion, and also outperformed the baseline LSTM [16], DNN [42], CNN [43], GAN (3-layer ReLU MLP) [44], CNN-GRU [45], and FCNN [46].Table 8 indicates the generalization capabilities of the suggested LSTM and other DL models.All DL models have been trained using a similar dataset comprising male and female speakers.The experimental values are averaged over all SNRs (-8dB, -4dB, 0dB, and 4dB) and noises.The results in Table 9 indicate that the suggested LSTM models have increased intelligibility and perceptual speech quality.The LSTM-AttenSkips-IRM and LSTM-AttenSkips-IBM have increased STOI by 4.4% and    To visualize spectral regions of the speech processed by deep learning models and the proposed LSTM models, we show spectro-temporal analysis.

Automatic Speech Recognition (ASR)
The SE evaluations show that the proposed LSTM models greatly suppressed the background noise and recovered high-quality and intelligible speech.As a result, we expect better speech recognition performance in challenging noisy backgrounds.The proposed SE models are implemented at the front end to achieve better ASR results.We implemented the Kaldi toolkit [47] which uses the GMM-HMM system and trained deep neural networks with Mel-frequency filter-bank features.The training system is motivated by Tachioka [48].We evaluated ASR performance in terms of word error rates (WERs).We randomly selected 2000 speech utterances from the TIMIT and LibriSpeech datasets to train the proposed LSTM-based speech enhancement models.With the trained LSTM models, we performed the speech enhancement and then synthesized time-domain utterances to create new training and testing datasets.We trained ASR models using the new training dataset and tested the ASR models using the new testing dataset.As given in Table 10, the ASR systems when trained with the utterances processed by LSTM-AttenSkips performed better.The WERs gradually decreased with the favorable SNR levels.On average, 19.13% WERs are achieved with the utterances processed by the proposed LSTM-AttenSkips, demonstrating that the proposed SE can be employed as a frontend to boost the ASR performance.

Conclusion
In this paper, we propose a speech enhancement algorithm that is based on recurrent neural networks trained with robust acoustic feature sets.An hourglass LSTM model is proposed which successfully captures the long-term temporal dependencies by reducing feature resolutions.We used skip connections between the nonadjacent symmetrical layers to prevent the gradient decay over layers.Moreover, an attention mechanism is adopted in skips to highlight the important features and spectral regions.A combined robust feature set is extracted from the magnitude of the noisy speech to robustly train the proposed models for better performance.Two masks, IRM and IBM, are estimated independently.The results have concluded the following aspects of the proposed SE algorithm.
By using the combined features learning, the model includes additional information which enabled the model to better learn the non-linear relation between noisy and clean speech which is confirmed by the results in Tables 4-8.The proposed LSTM models successfully captured long-term temporal dependencies and reduced the feature resolution by using an hourglass architecture to estimate the model parameters for testing which are confirmed by a comparison in Table 8 in the results.The memory overflow is avoided by using the proposed architecture.The skips and attention gate in the skips considerably improved the gradient decay over the layers and also highlighted the important features and spectral regions.The addition of attention gates in the skips obtains better results as indicated by Tables 6 and 7 on two different databases.With the hourglass strategy, the proposed models performed better than the baseline in terms of trainable parameters (18.89M with the proposed and 46.18M with the baseline).The proposed models performed better and outscored the recent deep learning models in different noises as indicated by Table 9.The proposed models also outperformed the related deep-learning methods in unseen noises as confirmed by Table 8.The . The learned parameters estimate the time-frequency mask (IRM) as a training target during the testing phase.The calculated magnitude mask |M(t, f)| is then multiplied by the magnitude of the noisy speech |Y(t, f)| to reduce background noise signals in the underlying clean speech |S (t, f)|.During waveform reconstruction, the predicted magnitude and the noisy phase are combined to generate improved speech.Fig 1 depicts the block diagram of the proposed speech enhancement.

Fig 5 .
Fig 5.The average STOI and PESQ improvements (STOIi and PESQi) of deep learning models over noisy speech.https://doi.org/10.1371/journal.pone.0291240.g005 Fig 6 demonstrates the spectrograms of the utterances.The underlying clean utterance (depicted in Fig 6a) is contaminated at 0dB babble noise in order to create a noisy utterance (depicted in Fig 6b).The babble noise (originated when many people talk simultaneously) is a difficult noisy situation because the noise signal follows the attributes similar to the underlying clean speech.The enhanced speech produced by the LSTM-IBM is illustrated in Fig 6(c), where the background babble noise is considerably eliminated.The enhanced speech produced by the LSTM-IRM (depicted in Fig 6d) shows minimum residual noise and speech distortion in comparison to the LSTM-IBM.Fig 6e illustrates the speech enhanced by the LSTM-AttenSkips-IBM.Minimum speech distortion and residual noise are noticeable.Fig 6f depicts the speech enhanced by the LSTM-AttenSkips-IBM.We can observe that the proposed model reduced the background noise leaving minimal residual noise and speech is not distorted, as confirmed by the spectrogram of noisy speech enhanced by the proposed LSTM-AttenSkips.

Table 2 . Brief comparison in terms of features, training objective, DNN type, and loss function.
/doi.org/10.1371/journal.pone.0291240.t002areconcatenatedasthenetworkinput at each time step.The input to the model is causal.However, as demonstrated in Table1, the network's computing process varies.There are different time steps in different layers, the calculation of the first time step of the second layer requires the output of the second time step of the first layer, and the calculation of the first time step of the third layer requires the output of the second, third, and fourth time steps of the first layer; therefore, when calculating the first time step of the output layer, the future time step of the first layer must be used.The deep model hyperparameters are listed in Table3.Here Units indicate the neurons.