Extracting speech spectrogram of speech signal based on generalized S-transform

Li Jiashen; Zhang Xianwu

doi:10.1371/journal.pone.0317362

Abstract

In speech signal processing, time-frequency analysis is commonly employed to extract the spectrogram of speech signals. While many algorithms exist to achieve this with high-quality results, they often lack the flexibility to adjust the resolution of the extracted spectrograms. However, applications such as speech recognition and speech separation frequently require spectrograms of varying resolutions. The flexibility of an algorithm in providing different resolutions is crucial for these applications. This paper introduces the generalized S-transform, and explains its fundamental theory and algorithmic implementation. By adjusting parameters, the proposed method flexibly produces spectrograms with different resolutions, offering a novel and effective approach to obtain speech signal spectrograms. The algorithm enhances the traditional Stockwell transform (S-transform) by incorporating a low-pass filtering function and introducing two adjustable parameters. These parameters modify the Gaussian window function of the basic S-transform, resulting with the generalized S-transform with customizable time-frequency resolution. Finally, this paper presents simulation experiments using both synthesized signals and real speech datas, comparing with the generalized S-transform with several commonly used spectrogram extraction algorithms. The experiments demonstrate that the generalized S-transform is feasible and effective, particularly when it is combined with the generalized fundamental frequency profile. The results indicate that this method is a viable and effective in obtaining spectrograms of speech signals, and has potential application in speech feature extraction and speech recognition. The pure speech dataset used in the experiments is sourced from a downloadable database and partially from a recorded speech set.

Citation: Jiashen L, Xianwu Z (2025) Extracting speech spectrogram of speech signal based on generalized S-transform. PLoS ONE 20(1): e0317362. https://doi.org/10.1371/journal.pone.0317362

Editor: Nattapol Aunsri, Mae Fah Luang University, THAILAND

Received: July 18, 2024; Accepted: December 26, 2024; Published: January 13, 2025

Copyright: © 2025 Jiashen, Xianwu. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All files are available from the DOI:10.6084/m9.figshare.27019942 database.

Funding: This work was supported in part by the Natural Science Foundation of Xinjiang Uygur Autonomous Region under Grant 2022D01C61, and in part by the “Tianchi Doctoral Program” Project of Xinjiang Uygur Autonomous Region under Grant TCBS202046. The funder had a role in the study design, data collection and analysis, publication decision, or manuscript writing, and that the funder is also the co-author of this article.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

As a type of non-stationary signal, speech signals are often effectively analyzed and processed using time-frequency analysis. For example, [1] presents an algorithm based on time-frequency analysis to extract the fundamental period of a speech signal. Time-frequency analysis represents a one-dimensional time signal as a two-dimensional time-frequency density function, which aims to reveal the frequency components within the signal and how these components vary over time. In speech signal processing, the visual representation that simultaneously displays time, frequency, and energy distribution is commonly referred to as a speech spectrogram [2].

Speech spectrograms, which provide a graphical display of speech feature information in both the time and frequency domains, are widely used in applications such as speech recognition and speech feature extraction. For instance, a new method for extracting features from speech spectrograms is proposed in [3]. Additionally, speech spectrograms have been combined with convolutional neural networks in [4] to improve speech intelligibility in individuals with speech impairments and serve as a tool for clinical symptom management. Common algorithms used in speech signal processing for extracting spectrograms include the Short-Time Fourier Transform (STFT) and Wavelet Transform [5]. The STFT is employed in [6–8] to extract speech spectrograms, which are then combined with deep learning for clinical diagnosis and speech quality enhancement. Wavelet transform is used in [9, 10] to extract speech spectrograms and is combined with other methods for speech enhancement research. However, STFT uses fixed-size sliding windows, which limits its ability to accurately analyze low-frequency signals with periods longer than the time window, and it offers relatively poor time-frequency resolution at high frequencies. Although the wavelet transform can adaptively reflect both low and high-frequency components, it is not closely related to the Fourier spectrum [2].

To address these limitations, R.G. Stockwell et al. proposed a new time-frequency analysis method known as the Stockwell transform (S-transform) [11–13], which combines the strengths and mitigates the weaknesses of STFT and wavelet transform. The S-transform introduces the multi-resolution analysis characteristics of the wavelet transform while maintaining a direct relationship with the Fourier spectrum. As a result, it has been widely used in the analysis of ground-penetrating radar, seismic wave data, and power systems [14–19].

For the window function in the S-transform is fixed, its application is limited. To enhance the flexibility and time-frequency resolution of the S-transform, Zhang Xianwu et al. modified the window function and proposed the generalized S-transform [19–21]. This variant has been practically applied to fields such as seismic data analysis, mechanical noise, and other signal processing areas. For example, [20] utilized the flexibility of the generalized S-transform for ground-penetrating radar layer identification applications. In this study, we leverage the ability of the generalized S-transform to flexibly adjust resolution in speech spectrograms, applying it to speech analysis. By adjusting the size of the window function through introduced parameters, we address the limitation of the fixed window function in the standard S-transform. To date, no scholars have applied the generalized S-transform to speech signal processing for obtaining speech spectrograms. This paper introduces the generalized S-transform as a time-frequency analysis method for flexible extraction of speech spectrograms. The extracted spectrogram can be used for tasks such as speech feature parameter extraction [22], endpoint detection [23], and more. With the advancement of artificial intelligence, spectrograms are increasingly used in deep learning models for speech recognition [24], music recognition [25], and related research. Extracting speech spectrograms using the generalized S-transform holds significant theoretical and practical value for speech analysis.

2 Rationale

The generalized S-transform, a time-frequency analysis method, was introduced earlier. Below, we will derive the computational formulas for both the S-transform and the generalized S-transform in detail.

2.1 S-transformation

R.G. Stockwell et al. combined the advantages of STFT and wavelet transform and changed the window function of STFT to Gaussian window function [11–14], i.e: (1) where a in Eq (1) is the scale factor, and t, τ denotes the time. Let , then the S-transform of h(t) is obtained as follows: (2) where t, τ denotes the time and f denotes the frequency, both real numbers; Define the time window function of the S-transform as G(t, f), (3)

The time window function G(t, f) of the S-transformation has to satisfy the following condition [21], i.e. (4) can be obtained under the condition that Eq (4) is satisfied: (5)

H(f) in Eq (5) is the Fourier transform of h(t), then h(t) is obtained from S(τ,f). (6)

In order to improve the efficiency of the calculation, the S-transform is usually implemented in the frequency domain, and the frequency domain S-positive transform is: (7)

2.2 Generalized S-transform (GST)

The generalized S-transform (GST), as proposed in this paper, extends the traditional S-transform by incorporating a low-pass filtering function. Additionally, it introduces a regularization parameter within the time window function. This parameter plays a crucial role in adjusting both the time and frequency resolutions of the GST. By varying the magnitude of this parameter, researchers can effectively control the trade-off between time resolution and frequency resolution in the transformed signal [20, 21].

Define the time window function of the generalized S-transform as the W(t,f,λ_G,λ_L): (8)

In Eq (8), λ_G and λ_L represent different adjustment parameters. The parameter λ_G is used to adjust the width of the time window: the larger λ_G is, the narrower the time window becomes. This increases the temporal resolution of the generalized S transform but simultaneously reduces its frequency resolution. Conversely, the frequency resolution of the generalized S transform is adjusted by modifying the parameter λ_L: the smaller λ_L is, the higher the frequency resolution becomes [20].

The time-window function of GST W(t,f,λ_G,λ_L) satisfies the time-window function condition for the S-transform [13]. (9)

Then the GST is obtained as follows: (10)

H(f) can be obtained under the condition that Eq (9) is satisfied: (11)

H(f) in Eq (11) is the Fourier transform of the signal h(t), then the inverse transform of the GST is: (12)

In order to improve the computational efficiency, the generalized S-transform can be implemented in the frequency domain, and the frequency domain transform equation is: (13)

In Eq (13) what L(η,f) is: (14)

To ensure that the GST positive and inverse transformations are fully invertible, the GST is satisfied when f = 0: (15)

3 Comparative analysis

The generalized S-transform enhances the original S-transform by introducing a low-pass filter and two adjustable parameters. These parameters allow for the modification of the window function size, thereby altering the resolution of the generalized S-transform. This approach combines the multi-resolution analysis capabilities of the wavelet transform while maintaining a direct relationship with the Fourier spectrum. Below, we provide a theoretical validation of the generalized S-transform and explore its application in speech signal processing.

To examine the characteristics of the generalized S-transform for analyzing time-varying signals, a time-varying signal h(t) is synthesized [20].

The signal h(t) primarily contains three frequency components: 100 Hz, 200 Hz, and 300 Hz. The starting and stopping times for each frequency component are 0 ∼ 449 ms, 274 ∼ 723 ms, and 550 ∼ 999 ms, respectively. The signal h(t) is sampled with a sampling interval of 1 ms, and the total sampling duration is 999 ms. Fig 1 presents a schematic diagram of the sampled signal h(t) [20].

Download:

Fig 1. Schematic of the synthesized time-varying signal h(t) after sampling.

https://doi.org/10.1371/journal.pone.0317362.g001

3.1 S-transform of the synthesized signal

The S-transform is performed on the signal h(t), and the results are shown in Fig 2. In Fig 2, the frequency components of the signal and their corresponding times align with the synthesized signal h(t). As the frequency of the signal increases, the frequency resolution of the S-transform decreases, while the time resolution increases [20].

Download:

Fig 2. Synthesized time-varying signal S-transform results.

https://doi.org/10.1371/journal.pone.0317362.g002

3.2 Generalized S-transform of synthesized signals

3.2.1 Effect of the tuning parameter λ_G on the generalized S-transform.

To analyze the effect of adjusting the parameter λ_G on the generalized S transform, we first fix λ_L at 10. Then, by adjusting λ_G and performing the generalized S transform on the signal h(t) for different values of λ_G, we can observe the results. Fig 3 shows the impact of λ_G on the generalized S transform. From Fig 3, it can be seen that, with λ_L held constant, the time resolution of the generalized S transform increases as λ_G increases. The time resolution of the generalized S transform is higher than that of the standard S transform when λ_G> 1. However, the frequency resolution of the generalized S transform decreases as λ_G increases. At λ_G = 2, due to the high time resolution and low-frequency resolution, the frequency components overlap during the periods 274 ms ∼ 449 ms and 550 ms ∼ 723 ms, making it impossible to distinguish them in the time-frequency domain after the generalized S transform [20].

Download:

Fig 3. Generalized S-transform result plots of the signal h(t) corresponding to the fixed parameter λ_L = 10, adjusting the size of the parameter λ_G.

In the figure (a-d) are the result plots corresponding to the different parameters λ_G, corresponding to the parameter sizes: (a)(λ_G = 0.1, λ_L = 10), (b)(λ_G = 0.5, λ_L = 10), (c)(λ_G = 1, λ_L = 10), (d)(λ_G = 2, λ_L = 10).

https://doi.org/10.1371/journal.pone.0317362.g003

3.2.2 Effect of tuning parameter λ_L on the generalized S-transform.

By fixing λ_G at 2 and adjusting λ_L, we perform the generalized S-transform on the signal h(t) for different values of λ_L. The results are shown in Fig 4. Comparing Figs 3d with 4a, we observe that the frequency components overlap during the periods of 274 ms ∼ 449 ms and 550 ms ∼ 723 ms. As λ_L decreases, the frequency resolution of the generalized S-transform increases, and the overlapping frequency components can be gradually separated when λ_L ≤ 0.2. When λ_L ≤ 0.2, the overlapping frequency components are separated and can be recognized in the time-frequency domain. The transform results in Figs 3d and 4a are the same, indicating that the generalized S-transform results do not change whenλ_G is increased to a certain value while λ_L remains constant. When λ_G = 1 and λ_L = +∞ the generalized S-transform and S-transform results are identical for any signal [20].

Download:

Fig 4. Generalized S-transform result plots of the signal h(t) corresponding to the fixed parameter λ_G = 2, adjusting the size of the parameter λ_L.

In the figure (a-d) are the result plots corresponding to the different parameters λ_L, corresponding to the parameter sizes: (a)(λ_G = 2, λ_L = 1), (b)(λ_G = 2, λ_L = 0.4), (c)(λ_G = 2, λ_L = 0.2), (d)(λ_G = 2, λ_L = 0.15).

https://doi.org/10.1371/journal.pone.0317362.g004

4 Generalized S-transform processing of speech signals

A speech signal is inherently non-stationary, but it can be considered quasi-stationary over short periods. Therefore, speech signal processing systems typically segment the signal into short-time frames of 10 ∼ 40 ms for analysis [2]. In this experiment, we used speech data from a Chinese public dataset, with a duration of approximately 2 seconds, a sampling frequency of 8000 Hz, and a single channel. The content is the phrase ‘blue sky and white clouds.’ S1 Video. We selected a frame length of 30 ms and a frame shift of 15 ms for the analysis, as shown in Fig 5. Each frame of the speech signal was processed using the S-transform and generalized S-transform. Fig 6 shows the resulting speech spectrogram extracted using the S-transform. The temporal resolution of the spectrogram is determined by the clarity of curves parallel to the vertical axis (frequency axis) and perpendicular to the horizontal axis (time axis). Similarly, the frequency resolution is determined by the clarity of curves parallel to the horizontal axis (time axis) and perpendicular to the vertical axis (frequency axis). As shown in Fig 6, the spectrogram obtained by the S-transform exhibits higher temporal resolution in the high-frequency regions and higher frequency resolution in the low-frequency regions. However, there is some frequency overlap in the low-frequency region, and the fundamental frequency and the second formant are not distinctly separated.

Download:

Fig 5. Time domain diagram of speech signal.

https://doi.org/10.1371/journal.pone.0317362.g005

Download:

Fig 6. Speech signal extracted spectrogram by S-transform.

https://doi.org/10.1371/journal.pone.0317362.g006

When applying the generalized S-transform to speech data, it is essential to select appropriate adjustment parameters λ_G and λ_L. A practical approach is to first set λ_L = 1 and then adjust the parameter λ_G based on the transformation results. As discussed in the previous section, a larger λ_G increases the temporal resolution of the generalized S-transform. To adjust λ_G, we can initially set it to 0.01 and then incrementally increase λ_G by 0.01 in each step, analyzing the corresponding results. Fig 7 illustrates the speech spectrograms obtained with λ_G values of 0.01, 0.25, 0.5, and 0.95, respectively. As λ_G increases, the temporal resolution of the speech spectrogram improves, while the frequency resolution decreases. This demonstrates that the temporal resolution of the spectrogram can be effectively controlled by adjusting the value of λ_G.

Download:

Fig 7. Fixing λ_L = 1 and adjusting the size of λ_G, with the gradual increase of λ_G, the time resolution of the high-frequency part of the spectrogram gradually increases and the frequency resolution gradually decreases.

(a-d) Figures are the spectrograms extracted with different λ_G sizes, and the corresponding parameter settings are: (a)(λ_G = 0.01, λ_L = 1), (b)(λ_G = 0.25, λ_L = 1), (c)(λ_G = 0.5, λ_L = 1), (d)(λ_G = 0.95, λ_L = 1).

https://doi.org/10.1371/journal.pone.0317362.g007

In the process of selecting λ_G, increasing λ_G gradually improves time resolution but simultaneously reduces the frequency resolution of the generalized S-transform. To balance this trade-off, after determining the adjustment parameter λ_G, λ_L should be reduced appropriately. The optimal values of λ_G and λ_L are selected when the different resolution components in the transform result are better separated. In the analysis of speech signals using the generalized S-transform, λ_G is set to 0.25 and λ_L to 1, then λ_L is gradually decreased in steps of 0.01. The results of the generalized S-transform for different λ_L values are compared, as shown in Fig 8, which displays the spectrograms for λ_L = 1, 0.5, 0.1, and 0.01, respectively. As λ_L decreases, the time resolution of the high-frequency portion of the spectrogram decreases while the frequency resolution increases. When λ_G = 0.25 and λ_L = 0.01, the frequency components in the spectrogram of Fig 8(d) are well separated, making it a suitable representation of the corresponding speech signal.

Download:

Fig 8.

Fixing λ_G = 0.25 and adjusting the size of λ_L, the time resolution of the high-frequency part of the spectrogram gradually decreases with the gradual decrease of λ_L, and the frequency resolution gradually increases, (a-d) Figs. are the spectrograms extracted by different sizes of λ_L, and the corresponding parameter settings are: (a)(λ_G = 0.25, λ_L = 1), (b)(λ_G = 0.25, λ_L = 0.5), (c)(λ_G = 0.25, λ_L = 0.1), (d)(λ_G = 0.25, λ_L = 0.01).

https://doi.org/10.1371/journal.pone.0317362.g008

However, experiments indicate that when λ_G is set to 0.05, adjusting λ_L does not significantly affect the resolution of the generalized S-transform, as shown in Fig 9. Similarly, when λ_L is set to 0.01, the resolution remains largely unchanged, as shown in Fig 10. The spectrograms in Figs 9 and 10 closely resemble the spectrogram in Fig 8(d) when λ_G = 0.25 and λ_L = 0.01. Therefore, the generalized S-transform parameters λ_G = 0.05 and λ_L = 0.01 are selected for speech data, as depicted in Fig 11. This flexibility in adjusting the parameters λ_G and λ_L in the generalized S-transform allows for obtaining spectrograms with different resolutions, making it highly adaptable for speech signal processing.

Download:

Fig 9.

Speech spectrograms extracted by adjusting the size of λ_L corresponding to the generalized S-transform when λ_G = 0.05, and (a-d) Figs. are the speech spectrograms extracted by different λ_L sizes corresponding to the size of the parameter settings, respectively: (a)(λ_G = 0.05, λ_L = 0.01), (b)(λ_G = 0.05, λ_L = 0.1), (c)(λ_G = 0.05, λ_L = 1), (d)(λ_G = 0.05, λ_L = 10).

https://doi.org/10.1371/journal.pone.0317362.g009

Download:

Fig 10.

Speech spectrograms extracted by adjusting the size of corresponding to the generalized S-transform when λ_L = 0.01, and (a-d) Figs. are the speech spectrograms extracted by different λ_G sizes corresponding to the size of the parameter settings, respectively: (a)(λ_G = 0.01, λ_L = 0.01), (b)(λ_G = 0.1, λ_L = 0.01), (c)(λ_G = 1, λ_L = 0.01), (d)(λ_G = 10, λ_L = 0.01).

https://doi.org/10.1371/journal.pone.0317362.g010

Download:

Fig 11. Spectrograms extracted from the generalized S-transform corresponding to the adjustment parameters λ_G = 0.05, λ_L = 0.01.

https://doi.org/10.1371/journal.pone.0317362.g011

From the S-transform results (Fig 6), it can be seen that the speech signal is S-transformed to show the distribution of the fundamental frequency, but when showing the resonance peaks, the time resolution is high, the frequency resolution is low, and the fundamental part and the second resonance peak are not finely delineated into two parts.

Upon comparing and analyzing Fig 11, it is clear that the generalized S-transform result allows for a finer distinction between the fundamental frequency and each resonance peak of the speech signal, with a marked improvement in the frequency resolution of both the fundamental and resonance peaks. The figure shows that the fundamental frequency of the speech signal lies between 100Hz and 200Hz. The signal strength at each time point is represented by the grayscale in the speech spectrogram. The curve parallel to the fundamental frequency waveform corresponds to the resonance peak information at the same time period. The speech spectrogram effectively reflects the dynamic spectral characteristics of the speech signal, providing a visual representation of the speech.

Under the same experimental conditions, time-frequency analyses of the speech signal were performed using the Short-Time Fourier Transform (STFT) and the wavelet transform (using a Morlet mother wavelet with a center frequency parameter of 50 and a bandwidth parameter of 3). The resulting spectrograms are shown in Figs 12 and 13, respectively. The spectrograms obtained from the generalized S-transform in Fig 11 are similar to those from the STFT in Fig 12, with the fundamental frequency primarily distributed between 100 Hz and 200 Hz. Within the effective speech range, the patterns of speech sound distribution are consistent, and the resonance peaks are distinct and similarly distributed.

Download:

Fig 12. Speech spectrogram extracted from speech signal processed by short time Fourier transform method.

https://doi.org/10.1371/journal.pone.0317362.g012

Download:

Fig 13. Speech spectrogram extracted from speech signal processed by wavelet transform method (Morlet mother wavelet with center frequency parameter of 50 and bandwidth parameter of 3 was selected).

https://doi.org/10.1371/journal.pone.0317362.g013

In contrast, Fig 13, which shows the spectrogram obtained via wavelet transform, demonstrates higher fundamental frequency resolution but lower time resolution. In the distribution of high-frequency resonance peaks, the time resolution is higher while the frequency resolution is lower. When compared with the fundamental frequency distribution in Fig 11, the two are similar, and the acoustic texture distribution within the effective speech is comparable, corresponding to the time-domain distribution of the speech signal in Fig 5. Comparative analysis with the spectrograms obtained from the STFT and wavelet transform indicates that the generalized S-transform is a feasible method for extracting speech signal spectrograms.

5 Evaluating the effectiveness of GST in acquiring speech maps

The experiment demonstrates that GST is a feasible method for extracting speech spectrograms. To validate the effectiveness of GST in obtaining accurate speech spectrograms, the inverse spectral method is employed to extract the fundamental frequency curve of speech, serving as a reference index. The effectiveness of the spectrogram is then verified against this reference. The formula for extracting the speech gene frequency curve using the inverse spectral method is as follows:

The Fourier transform of the signal x(n) is given by: (16)

The process of applying a logarithm to the amplitude after performing the Fourier transform, followed by solving for the inverse Fourier transform, is known as cepstrum [2]. (17)

The cepstrum sequence of x(n) is the inverse Fourier transform of the logarithm of the magnitude spectrum of x(n), where FT and FT⁻¹ denote the Fourier transform and inverse Fourier transform, respectively.

A similar method is applied to the speech signal, where the speech signal x(n) is obtained from the excitation pulse u(n) filtered by the vocal tract response v(n). The speech signal can be modeled as: (18)

The Fourier transform of this model is: (19)

Solving for the cepstrum yields: (20)

From Eqs (16) to (20), it is evident that the cepstrum of the acoustic pulse FT⁻¹ln(|U(ω)|) and the cepstrum of the vocal tract response FT⁻¹ln(|V(ω)|) can be separated by the cepstral algorithm. This separation allows for the recovery of the excitation signal u(n) from the cepstrum domain FT⁻¹ln(|U(ω)|). The distribution of the fundamental frequency curve in the final speech signal is shown in Fig 14 below:

Download:

Fig 14.

Cepstrum method for extracting the fundamental frequency curve in a speech signal, where the top graph shows the time domain distribution of the speech signal, the middle graph shows the distribution of fundamental period sampling points, and the bottom graph shows the distribution of gene frequency calculated by the sampling points of the middle graph.

https://doi.org/10.1371/journal.pone.0317362.g014

The fundamental frequency curve is obtained using the inverse spectral method and mapped onto the spectrogram. To verify the effectiveness of the generalized S-transform (GST) in obtaining the spectrogram, it is compared with the STFT and wavelet transform algorithms, using the fundamental frequency curve as a reference. In this comparison, the STFT uses a fixed Hamming window with a width of 30 ms, while the wavelet transform employs the Morlet wavelet with a center frequency parameter of 3 and a bandwidth parameter of 3. The results of each algorithm are shown in Figs 15 to 17 below.

Download:

Fig 15. Generalized S-transform to extract the speech spectrogram and superimpose the fundamental frequency profile onto the spectrogram resultant plot.

https://doi.org/10.1371/journal.pone.0317362.g015

Download:

Fig 16. Resulting plot of STFT extracting the speech spectrogram and superimposing the fundamental frequency curve onto the spectrogram.

https://doi.org/10.1371/journal.pone.0317362.g016

Download:

Fig 17. Wavelet transform to extract the speech spectrogram and superimpose the fundamental frequency curve onto the spectrogram resultant plot.

https://doi.org/10.1371/journal.pone.0317362.g017

Fig 15 shows the spectrogram of the fundamental frequency obtained by GST (with selected parameter values of λ_G = 0.2 and λ_L = 0.1, providing high resolution of the fundamental frequency). Based on the alignment of the fundamental frequency curves and the spectrogram’s distribution, it can be confirmed that the speech spectrogram extracted by the generalized S-transform proposed in this paper is effective. Figs 16 and 17 illustrate the spectrograms obtained by the STFT and wavelet transform, respectively. The comparison of these figures, under the same fundamental frequency distribution curve, shows that the distribution of the fundamental frequency is consistent across the spectrograms extracted by all three methods. This verifies that the speech spectrograms extracted by the generalized S-transform method proposed in this paper are both feasible and effective for speech signals.

6 Conclusion

In this paper, we introduce a low-pass filter to the time window function based on the S-transform and incorporate two adjustment parameters to control the time and frequency resolution of the generalized S-transform. By selecting appropriate adjustment parameters, the generalized S-transform can be effectively tailored for processing speech signals. Experimental comparisons demonstrate that the generalized S-transform offers high flexibility and feasibility in extracting speech spectrograms. In the future, our research group will explore the application of the generalized S-transform in noisy speech and the train with large datasets.

Supporting information

S1 Video. The content of the speech.

https://doi.org/10.1371/journal.pone.0317362.s001

(WAV)

References

1. Kwong S, Gang W, Lee CH. A pitch detection algorithm based on time-frequency analysis. In: [Proceedings] Singapore ICCS/ISITA ‘92; 1992. p. 432–436 vol.2.
2. Rabiner LR, Schafer RW. Theory and Applications of Digital Speech Processing. Journal of Applied Mechanics. 2010;30(1):445–447.
- View Article
- Google Scholar
3. Mulimani M, Koolagudi SG. Acoustic Event Classification Using Spectrogram Features. In: TENCON 2018—2018 IEEE Region 10 Conference; 2018. p. 1460–1464.
4. H M C, Karjigi V, Sreedevi N. Investigation of Different Time-Frequency Representations for Intelligibility Assessment of Dysarthric Speech. IEEE Transactions on Neural Systems and Rehabilitation Engineering. 2020;28(12):2880–2889. pmid:33141673
- View Article
- PubMed/NCBI
- Google Scholar
5. Chen D, Xiao T, Hu W, Wu Q. Pitches Detection of Mixed Speech using Synchrosqueezing Wavelet Transform. In: 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI); 2021. p. 254–260.
6. Elfaki A, Asnawi AL, Jusoh AZ, Ismail AF, Ibrahim SN, Mohamed Azmin NF, et al. Using the Short-Time Fourier Transform and ResNet to Diagnose Depression from Speech Data. In: 2021 IEEE International Conference on Computing (ICOCO); 2021. p. 372–376.
7. Kaneko T, Tanaka K, Kameoka H, Seki S. ISTFTNET: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform. In: ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2022. p. 6207–6211.
8. Radha K, Rao DV, Sai KVK, Krishna RT, Muneera A. Detecting Autism Spectrum Disorder from Raw Speech in Children using STFT Layered CNN Model. In: 2024 International Conference on Green Energy, Computing and Sustainable Technology (GECOST); 2024. p. 437–441.
9. Kornsing S, Srinonchat J. Enhancement Speech Compression Technique Using Modern Wavelet Transforms. In: 2012 International Symposium on Computer, Consumer and Control; 2012. p. 393–396.
10. Li R, Bao C, Xia B, Jia M. Speech enhancement using the combination of adaptive wavelet threshold and spectral subtraction based on wavelet packet decomposition. In: 2012 IEEE 11th International Conference on Signal Processing. vol. 1; 2012. p. 481–484.
11. Stockwell RG, Mansinha L, Lowe RP. Localization of the complex spectrum: the S transform. IEEE Transactions on Signal Processing. 1996;44(4):998–1001.
- View Article
- Google Scholar
12. Stockwell RG. A basis for efficient representation of the S-transform. Digital Signal Processing. 2007;17(1):371–393.
- View Article
- Google Scholar
13. Pinnegar R, Mansinha L. The S-transform with windows of arbitrary and varying shape. GEOPHYSICS. 2003;68.
- View Article
- Google Scholar
14. Liu W, Zhai Z, Fang Z. A Multisynchrosqueezing-Based S-Transform for Time-Frequency Analysis of Seismic Data. Pure and Applied Geophysics. 2024; p. 1–17.
- View Article
- Google Scholar
15. Aunsri N, Jakkaew P, Kuptametee C. Investigation and evaluation of cross-term reduction in masked Wigner-Ville distributions using S-transforms. PLOS ONE. 2024;19(11):e0310721. pmid:39504330
- View Article
- PubMed/NCBI
- Google Scholar
16. Wei D, Shen J. Synchrosqueezing Fractional S-transform: Theory, Implementation and Applications. Circuits Syst Signal Process. 2023;43(3):1572–1596.
- View Article
- Google Scholar
17. Wang Y, Peng Z, He Y. Time-frequency representation for seismic data using sparse S transform. In: 2016 2nd IEEE International Conference on Computer and Communications (ICCC); 2016. p. 1923–1926.
18. Thangaraj K, Muruganandham J, Selvaumar S, Jagan R. Analysis of harmonics using S-Transform. In: 2016 International Conference on Emerging Trends in Engineering, Technology and Science (ICETETS); 2016. p. 1–5.
19. Liu N, Gao J, Zhang B, Li F, Wang Q. Time–Frequency Analysis of Seismic Data Using a Three Parameters S Transform. IEEE Geoscience and Remote Sensing Letters. 2018;15(1):142–146.
- View Article
- Google Scholar
20. ZHANG Xian-Wu FGY GAO Yun-Ze. Application of generalized S transform with lowpass filtering to layer recognition of Ground Penetrating Radar. Chinese Journal of Geophysics (in Chinese). 2013;56(1):309–316.
- View Article
- Google Scholar
21. Rao X, Xu Z, Guan H, Ding Y, Yang X, Yang Y. Cable Defect Location by Using Frequency Domain Reflectometry with Synchrosqueezing Generalized S-Transform. In: 2023 Panda Forum on Power and Energy (PandaFPE); 2023.p. 1178–1182.
- View Article
- Google Scholar
22. Wang GY, Zhang YM, Sun ML, Wang X, Zhang Y. Speech signal feature parameters extraction algorithm based on PCNN for isolated word recognition. In: 2016 International Conference on Audio, Language and Image Processing (ICALIP); 2016. p. 679–682.
23. Wu D, Tao Z, Wu Y, Shen C, Xiao Z, Zhang X, et al. Speech endpoint detection in noisy environment using Spectrogram Boundary Factor. In: 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI); 2016. p. 964–968.
24. Prasomphan S. Detecting human emotion via speech recognition by using speech spectrogram. In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA); 2015. p. 1–10.
25. M R N, Mohan B S S. Music Genre Classification using Spectrograms. In: 2020 International Conference on Power, Instrumentation, Control and Computing (PICC); 2020. p. 1–5.

[ref1] 1. Kwong S, Gang W, Lee CH. A pitch detection algorithm based on time-frequency analysis. In: [Proceedings] Singapore ICCS/ISITA ‘92; 1992. p. 432–436 vol.2.

[ref2] 2. Rabiner LR, Schafer RW. Theory and Applications of Digital Speech Processing. Journal of Applied Mechanics. 2010;30(1):445–447.
View Article
Google Scholar

[3] View Article

[4] Google Scholar

[ref3] 3. Mulimani M, Koolagudi SG. Acoustic Event Classification Using Spectrogram Features. In: TENCON 2018—2018 IEEE Region 10 Conference; 2018. p. 1460–1464.

[ref4] 4. H M C, Karjigi V, Sreedevi N. Investigation of Different Time-Frequency Representations for Intelligibility Assessment of Dysarthric Speech. IEEE Transactions on Neural Systems and Rehabilitation Engineering. 2020;28(12):2880–2889. pmid:33141673
View Article
PubMed/NCBI
Google Scholar

[7] View Article

[8] PubMed/NCBI

[9] Google Scholar

[ref5] 5. Chen D, Xiao T, Hu W, Wu Q. Pitches Detection of Mixed Speech using Synchrosqueezing Wavelet Transform. In: 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI); 2021. p. 254–260.

[ref6] 6. Elfaki A, Asnawi AL, Jusoh AZ, Ismail AF, Ibrahim SN, Mohamed Azmin NF, et al. Using the Short-Time Fourier Transform and ResNet to Diagnose Depression from Speech Data. In: 2021 IEEE International Conference on Computing (ICOCO); 2021. p. 372–376.

[ref7] 7. Kaneko T, Tanaka K, Kameoka H, Seki S. ISTFTNET: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform. In: ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2022. p. 6207–6211.

[ref8] 8. Radha K, Rao DV, Sai KVK, Krishna RT, Muneera A. Detecting Autism Spectrum Disorder from Raw Speech in Children using STFT Layered CNN Model. In: 2024 International Conference on Green Energy, Computing and Sustainable Technology (GECOST); 2024. p. 437–441.

[ref9] 9. Kornsing S, Srinonchat J. Enhancement Speech Compression Technique Using Modern Wavelet Transforms. In: 2012 International Symposium on Computer, Consumer and Control; 2012. p. 393–396.

[ref10] 10. Li R, Bao C, Xia B, Jia M. Speech enhancement using the combination of adaptive wavelet threshold and spectral subtraction based on wavelet packet decomposition. In: 2012 IEEE 11th International Conference on Signal Processing. vol. 1; 2012. p. 481–484.

[ref11] 11. Stockwell RG, Mansinha L, Lowe RP. Localization of the complex spectrum: the S transform. IEEE Transactions on Signal Processing. 1996;44(4):998–1001.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref12] 12. Stockwell RG. A basis for efficient representation of the S-transform. Digital Signal Processing. 2007;17(1):371–393.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref13] 13. Pinnegar R, Mansinha L. The S-transform with windows of arbitrary and varying shape. GEOPHYSICS. 2003;68.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref14] 14. Liu W, Zhai Z, Fang Z. A Multisynchrosqueezing-Based S-Transform for Time-Frequency Analysis of Seismic Data. Pure and Applied Geophysics. 2024; p. 1–17.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref15] 15. Aunsri N, Jakkaew P, Kuptametee C. Investigation and evaluation of cross-term reduction in masked Wigner-Ville distributions using S-transforms. PLOS ONE. 2024;19(11):e0310721. pmid:39504330
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref16] 16. Wei D, Shen J. Synchrosqueezing Fractional S-transform: Theory, Implementation and Applications. Circuits Syst Signal Process. 2023;43(3):1572–1596.
View Article
Google Scholar

[33] View Article

[34] Google Scholar

[ref17] 17. Wang Y, Peng Z, He Y. Time-frequency representation for seismic data using sparse S transform. In: 2016 2nd IEEE International Conference on Computer and Communications (ICCC); 2016. p. 1923–1926.

[ref18] 18. Thangaraj K, Muruganandham J, Selvaumar S, Jagan R. Analysis of harmonics using S-Transform. In: 2016 International Conference on Emerging Trends in Engineering, Technology and Science (ICETETS); 2016. p. 1–5.

[ref19] 19. Liu N, Gao J, Zhang B, Li F, Wang Q. Time–Frequency Analysis of Seismic Data Using a Three Parameters S Transform. IEEE Geoscience and Remote Sensing Letters. 2018;15(1):142–146.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref20] 20. ZHANG Xian-Wu FGY GAO Yun-Ze. Application of generalized S transform with lowpass filtering to layer recognition of Ground Penetrating Radar. Chinese Journal of Geophysics (in Chinese). 2013;56(1):309–316.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref21] 21. Rao X, Xu Z, Guan H, Ding Y, Yang X, Yang Y. Cable Defect Location by Using Frequency Domain Reflectometry with Synchrosqueezing Generalized S-Transform. In: 2023 Panda Forum on Power and Energy (PandaFPE); 2023.p. 1178–1182.
View Article
Google Scholar

[44] View Article

[45] Google Scholar

[ref22] 22. Wang GY, Zhang YM, Sun ML, Wang X, Zhang Y. Speech signal feature parameters extraction algorithm based on PCNN for isolated word recognition. In: 2016 International Conference on Audio, Language and Image Processing (ICALIP); 2016. p. 679–682.

[ref23] 23. Wu D, Tao Z, Wu Y, Shen C, Xiao Z, Zhang X, et al. Speech endpoint detection in noisy environment using Spectrogram Boundary Factor. In: 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI); 2016. p. 964–968.

[ref24] 24. Prasomphan S. Detecting human emotion via speech recognition by using speech spectrogram. In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA); 2015. p. 1–10.

[ref25] 25. M R N, Mohan B S S. Music Genre Classification using Spectrograms. In: 2020 International Conference on Power, Instrumentation, Control and Computing (PICC); 2020. p. 1–5.

Figures

Abstract

1 Introduction

2 Rationale

2.1 S-transformation

2.2 Generalized S-transform (GST)

3 Comparative analysis

3.1 S-transform of the synthesized signal

3.2 Generalized S-transform of synthesized signals

3.2.1 Effect of the tuning parameter λG on the generalized S-transform.

3.2.2 Effect of tuning parameter λL on the generalized S-transform.

4 Generalized S-transform processing of speech signals

5 Evaluating the effectiveness of GST in acquiring speech maps

6 Conclusion

Supporting information

S1 Video. The content of the speech.

References

3.2.1 Effect of the tuning parameter λ_G on the generalized S-transform.

3.2.2 Effect of tuning parameter λ_L on the generalized S-transform.