Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Extracting speech spectrogram of speech signal based on generalized S-transform

  • Li Jiashen,

    Roles Data curation, Formal analysis, Methodology, Writing – original draft, Writing – review & editing

    Affiliation College of Computer Science and Technology, Xinjiang University, Urumqi, Xinjiang, China

  • Zhang Xianwu

    Roles Project administration, Supervision, Writing – review & editing

    zxw@xju.edu.cn

    Affiliation College of Computer Science and Technology, Xinjiang University, Urumqi, Xinjiang, China

Abstract

In speech signal processing, time-frequency analysis is commonly employed to extract the spectrogram of speech signals. While many algorithms exist to achieve this with high-quality results, they often lack the flexibility to adjust the resolution of the extracted spectrograms. However, applications such as speech recognition and speech separation frequently require spectrograms of varying resolutions. The flexibility of an algorithm in providing different resolutions is crucial for these applications. This paper introduces the generalized S-transform, and explains its fundamental theory and algorithmic implementation. By adjusting parameters, the proposed method flexibly produces spectrograms with different resolutions, offering a novel and effective approach to obtain speech signal spectrograms. The algorithm enhances the traditional Stockwell transform (S-transform) by incorporating a low-pass filtering function and introducing two adjustable parameters. These parameters modify the Gaussian window function of the basic S-transform, resulting with the generalized S-transform with customizable time-frequency resolution. Finally, this paper presents simulation experiments using both synthesized signals and real speech datas, comparing with the generalized S-transform with several commonly used spectrogram extraction algorithms. The experiments demonstrate that the generalized S-transform is feasible and effective, particularly when it is combined with the generalized fundamental frequency profile. The results indicate that this method is a viable and effective in obtaining spectrograms of speech signals, and has potential application in speech feature extraction and speech recognition. The pure speech dataset used in the experiments is sourced from a downloadable database and partially from a recorded speech set.

1 Introduction

As a type of non-stationary signal, speech signals are often effectively analyzed and processed using time-frequency analysis. For example, [1] presents an algorithm based on time-frequency analysis to extract the fundamental period of a speech signal. Time-frequency analysis represents a one-dimensional time signal as a two-dimensional time-frequency density function, which aims to reveal the frequency components within the signal and how these components vary over time. In speech signal processing, the visual representation that simultaneously displays time, frequency, and energy distribution is commonly referred to as a speech spectrogram [2].

Speech spectrograms, which provide a graphical display of speech feature information in both the time and frequency domains, are widely used in applications such as speech recognition and speech feature extraction. For instance, a new method for extracting features from speech spectrograms is proposed in [3]. Additionally, speech spectrograms have been combined with convolutional neural networks in [4] to improve speech intelligibility in individuals with speech impairments and serve as a tool for clinical symptom management. Common algorithms used in speech signal processing for extracting spectrograms include the Short-Time Fourier Transform (STFT) and Wavelet Transform [5]. The STFT is employed in [68] to extract speech spectrograms, which are then combined with deep learning for clinical diagnosis and speech quality enhancement. Wavelet transform is used in [9, 10] to extract speech spectrograms and is combined with other methods for speech enhancement research. However, STFT uses fixed-size sliding windows, which limits its ability to accurately analyze low-frequency signals with periods longer than the time window, and it offers relatively poor time-frequency resolution at high frequencies. Although the wavelet transform can adaptively reflect both low and high-frequency components, it is not closely related to the Fourier spectrum [2].

To address these limitations, R.G. Stockwell et al. proposed a new time-frequency analysis method known as the Stockwell transform (S-transform) [1113], which combines the strengths and mitigates the weaknesses of STFT and wavelet transform. The S-transform introduces the multi-resolution analysis characteristics of the wavelet transform while maintaining a direct relationship with the Fourier spectrum. As a result, it has been widely used in the analysis of ground-penetrating radar, seismic wave data, and power systems [1419].

For the window function in the S-transform is fixed, its application is limited. To enhance the flexibility and time-frequency resolution of the S-transform, Zhang Xianwu et al. modified the window function and proposed the generalized S-transform [1921]. This variant has been practically applied to fields such as seismic data analysis, mechanical noise, and other signal processing areas. For example, [20] utilized the flexibility of the generalized S-transform for ground-penetrating radar layer identification applications. In this study, we leverage the ability of the generalized S-transform to flexibly adjust resolution in speech spectrograms, applying it to speech analysis. By adjusting the size of the window function through introduced parameters, we address the limitation of the fixed window function in the standard S-transform. To date, no scholars have applied the generalized S-transform to speech signal processing for obtaining speech spectrograms. This paper introduces the generalized S-transform as a time-frequency analysis method for flexible extraction of speech spectrograms. The extracted spectrogram can be used for tasks such as speech feature parameter extraction [22], endpoint detection [23], and more. With the advancement of artificial intelligence, spectrograms are increasingly used in deep learning models for speech recognition [24], music recognition [25], and related research. Extracting speech spectrograms using the generalized S-transform holds significant theoretical and practical value for speech analysis.

2 Rationale

The generalized S-transform, a time-frequency analysis method, was introduced earlier. Below, we will derive the computational formulas for both the S-transform and the generalized S-transform in detail.

2.1 S-transformation

R.G. Stockwell et al. combined the advantages of STFT and wavelet transform and changed the window function of STFT to Gaussian window function [1114], i.e: (1) where a in Eq (1) is the scale factor, and t, τ denotes the time. Let , then the S-transform of h(t) is obtained as follows: (2) where t, τ denotes the time and f denotes the frequency, both real numbers; Define the time window function of the S-transform as G(t, f), (3)

The time window function G(t, f) of the S-transformation has to satisfy the following condition [21], i.e. (4) can be obtained under the condition that Eq (4) is satisfied: (5)

H(f) in Eq (5) is the Fourier transform of h(t), then h(t) is obtained from S(τ,f). (6)

In order to improve the efficiency of the calculation, the S-transform is usually implemented in the frequency domain, and the frequency domain S-positive transform is: (7)

2.2 Generalized S-transform (GST)

The generalized S-transform (GST), as proposed in this paper, extends the traditional S-transform by incorporating a low-pass filtering function. Additionally, it introduces a regularization parameter within the time window function. This parameter plays a crucial role in adjusting both the time and frequency resolutions of the GST. By varying the magnitude of this parameter, researchers can effectively control the trade-off between time resolution and frequency resolution in the transformed signal [20, 21].

Define the time window function of the generalized S-transform as the W(t,f,λGL): (8)

In Eq (8), λG and λL represent different adjustment parameters. The parameter λG is used to adjust the width of the time window: the larger λG is, the narrower the time window becomes. This increases the temporal resolution of the generalized S transform but simultaneously reduces its frequency resolution. Conversely, the frequency resolution of the generalized S transform is adjusted by modifying the parameter λL: the smaller λL is, the higher the frequency resolution becomes [20].

The time-window function of GST W(t,f,λGL) satisfies the time-window function condition for the S-transform [13]. (9)

Then the GST is obtained as follows: (10)

H(f) can be obtained under the condition that Eq (9) is satisfied: (11)

H(f) in Eq (11) is the Fourier transform of the signal h(t), then the inverse transform of the GST is: (12)

In order to improve the computational efficiency, the generalized S-transform can be implemented in the frequency domain, and the frequency domain transform equation is: (13)

In Eq (13) what L(η,f) is: (14)

To ensure that the GST positive and inverse transformations are fully invertible, the GST is satisfied when f = 0: (15)

3 Comparative analysis

The generalized S-transform enhances the original S-transform by introducing a low-pass filter and two adjustable parameters. These parameters allow for the modification of the window function size, thereby altering the resolution of the generalized S-transform. This approach combines the multi-resolution analysis capabilities of the wavelet transform while maintaining a direct relationship with the Fourier spectrum. Below, we provide a theoretical validation of the generalized S-transform and explore its application in speech signal processing.

To examine the characteristics of the generalized S-transform for analyzing time-varying signals, a time-varying signal h(t) is synthesized [20].

The signal h(t) primarily contains three frequency components: 100 Hz, 200 Hz, and 300 Hz. The starting and stopping times for each frequency component are 0 ∼ 449 ms, 274 ∼ 723 ms, and 550 ∼ 999 ms, respectively. The signal h(t) is sampled with a sampling interval of 1 ms, and the total sampling duration is 999 ms. Fig 1 presents a schematic diagram of the sampled signal h(t) [20].

thumbnail
Fig 1. Schematic of the synthesized time-varying signal h(t) after sampling.

https://doi.org/10.1371/journal.pone.0317362.g001

3.1 S-transform of the synthesized signal

The S-transform is performed on the signal h(t), and the results are shown in Fig 2. In Fig 2, the frequency components of the signal and their corresponding times align with the synthesized signal h(t). As the frequency of the signal increases, the frequency resolution of the S-transform decreases, while the time resolution increases [20].

thumbnail
Fig 2. Synthesized time-varying signal S-transform results.

https://doi.org/10.1371/journal.pone.0317362.g002

3.2 Generalized S-transform of synthesized signals

3.2.1 Effect of the tuning parameter λG on the generalized S-transform.

To analyze the effect of adjusting the parameter λG on the generalized S transform, we first fix λL at 10. Then, by adjusting λG and performing the generalized S transform on the signal h(t) for different values of λG, we can observe the results. Fig 3 shows the impact of λG on the generalized S transform. From Fig 3, it can be seen that, with λL held constant, the time resolution of the generalized S transform increases as λG increases. The time resolution of the generalized S transform is higher than that of the standard S transform when λG> 1. However, the frequency resolution of the generalized S transform decreases as λG increases. At λG = 2, due to the high time resolution and low-frequency resolution, the frequency components overlap during the periods 274 ms ∼ 449 ms and 550 ms ∼ 723 ms, making it impossible to distinguish them in the time-frequency domain after the generalized S transform [20].

thumbnail
Fig 3. Generalized S-transform result plots of the signal h(t) corresponding to the fixed parameter λL = 10, adjusting the size of the parameter λG.

In the figure (a-d) are the result plots corresponding to the different parameters λG, corresponding to the parameter sizes: (a)(λG = 0.1, λL = 10), (b)(λG = 0.5, λL = 10), (c)(λG = 1, λL = 10), (d)(λG = 2, λL = 10).

https://doi.org/10.1371/journal.pone.0317362.g003

3.2.2 Effect of tuning parameter λL on the generalized S-transform.

By fixing λG at 2 and adjusting λL, we perform the generalized S-transform on the signal h(t) for different values of λL. The results are shown in Fig 4. Comparing Figs 3d with 4a, we observe that the frequency components overlap during the periods of 274 ms ∼ 449 ms and 550 ms ∼ 723 ms. As λL decreases, the frequency resolution of the generalized S-transform increases, and the overlapping frequency components can be gradually separated when λL ≤ 0.2. When λL ≤ 0.2, the overlapping frequency components are separated and can be recognized in the time-frequency domain. The transform results in Figs 3d and 4a are the same, indicating that the generalized S-transform results do not change whenλG is increased to a certain value while λL remains constant. When λG = 1 and λL = +∞ the generalized S-transform and S-transform results are identical for any signal [20].

thumbnail
Fig 4. Generalized S-transform result plots of the signal h(t) corresponding to the fixed parameter λG = 2, adjusting the size of the parameter λL.

In the figure (a-d) are the result plots corresponding to the different parameters λL, corresponding to the parameter sizes: (a)(λG = 2, λL = 1), (b)(λG = 2, λL = 0.4), (c)(λG = 2, λL = 0.2), (d)(λG = 2, λL = 0.15).

https://doi.org/10.1371/journal.pone.0317362.g004

4 Generalized S-transform processing of speech signals

A speech signal is inherently non-stationary, but it can be considered quasi-stationary over short periods. Therefore, speech signal processing systems typically segment the signal into short-time frames of 10 ∼ 40 ms for analysis [2]. In this experiment, we used speech data from a Chinese public dataset, with a duration of approximately 2 seconds, a sampling frequency of 8000 Hz, and a single channel. The content is the phrase ‘blue sky and white clouds.’ S1 Video. We selected a frame length of 30 ms and a frame shift of 15 ms for the analysis, as shown in Fig 5. Each frame of the speech signal was processed using the S-transform and generalized S-transform. Fig 6 shows the resulting speech spectrogram extracted using the S-transform. The temporal resolution of the spectrogram is determined by the clarity of curves parallel to the vertical axis (frequency axis) and perpendicular to the horizontal axis (time axis). Similarly, the frequency resolution is determined by the clarity of curves parallel to the horizontal axis (time axis) and perpendicular to the vertical axis (frequency axis). As shown in Fig 6, the spectrogram obtained by the S-transform exhibits higher temporal resolution in the high-frequency regions and higher frequency resolution in the low-frequency regions. However, there is some frequency overlap in the low-frequency region, and the fundamental frequency and the second formant are not distinctly separated.

thumbnail
Fig 6. Speech signal extracted spectrogram by S-transform.

https://doi.org/10.1371/journal.pone.0317362.g006

When applying the generalized S-transform to speech data, it is essential to select appropriate adjustment parameters λG and λL. A practical approach is to first set λL = 1 and then adjust the parameter λG based on the transformation results. As discussed in the previous section, a larger λG increases the temporal resolution of the generalized S-transform. To adjust λG, we can initially set it to 0.01 and then incrementally increase λG by 0.01 in each step, analyzing the corresponding results. Fig 7 illustrates the speech spectrograms obtained with λG values of 0.01, 0.25, 0.5, and 0.95, respectively. As λG increases, the temporal resolution of the speech spectrogram improves, while the frequency resolution decreases. This demonstrates that the temporal resolution of the spectrogram can be effectively controlled by adjusting the value of λG.

thumbnail
Fig 7. Fixing λL = 1 and adjusting the size of λG, with the gradual increase of λG, the time resolution of the high-frequency part of the spectrogram gradually increases and the frequency resolution gradually decreases.

(a-d) Figures are the spectrograms extracted with different λG sizes, and the corresponding parameter settings are: (a)(λG = 0.01, λL = 1), (b)(λG = 0.25, λL = 1), (c)(λG = 0.5, λL = 1), (d)(λG = 0.95, λL = 1).

https://doi.org/10.1371/journal.pone.0317362.g007

In the process of selecting λG, increasing λG gradually improves time resolution but simultaneously reduces the frequency resolution of the generalized S-transform. To balance this trade-off, after determining the adjustment parameter λG, λL should be reduced appropriately. The optimal values of λG and λL are selected when the different resolution components in the transform result are better separated. In the analysis of speech signals using the generalized S-transform, λG is set to 0.25 and λL to 1, then λL is gradually decreased in steps of 0.01. The results of the generalized S-transform for different λL values are compared, as shown in Fig 8, which displays the spectrograms for λL = 1, 0.5, 0.1, and 0.01, respectively. As λL decreases, the time resolution of the high-frequency portion of the spectrogram decreases while the frequency resolution increases. When λG = 0.25 and λL = 0.01, the frequency components in the spectrogram of Fig 8(d) are well separated, making it a suitable representation of the corresponding speech signal.

thumbnail
Fig 8.

Fixing λG = 0.25 and adjusting the size of λL, the time resolution of the high-frequency part of the spectrogram gradually decreases with the gradual decrease of λL, and the frequency resolution gradually increases, (a-d) Figs. are the spectrograms extracted by different sizes of λL, and the corresponding parameter settings are: (a)(λG = 0.25, λL = 1), (b)(λG = 0.25, λL = 0.5), (c)(λG = 0.25, λL = 0.1), (d)(λG = 0.25, λL = 0.01).

https://doi.org/10.1371/journal.pone.0317362.g008

However, experiments indicate that when λG is set to 0.05, adjusting λL does not significantly affect the resolution of the generalized S-transform, as shown in Fig 9. Similarly, when λL is set to 0.01, the resolution remains largely unchanged, as shown in Fig 10. The spectrograms in Figs 9 and 10 closely resemble the spectrogram in Fig 8(d) when λG = 0.25 and λL = 0.01. Therefore, the generalized S-transform parameters λG = 0.05 and λL = 0.01 are selected for speech data, as depicted in Fig 11. This flexibility in adjusting the parameters λG and λL in the generalized S-transform allows for obtaining spectrograms with different resolutions, making it highly adaptable for speech signal processing.

thumbnail
Fig 9.

Speech spectrograms extracted by adjusting the size of λL corresponding to the generalized S-transform when λG = 0.05, and (a-d) Figs. are the speech spectrograms extracted by different λL sizes corresponding to the size of the parameter settings, respectively: (a)(λG = 0.05, λL = 0.01), (b)(λG = 0.05, λL = 0.1), (c)(λG = 0.05, λL = 1), (d)(λG = 0.05, λL = 10).

https://doi.org/10.1371/journal.pone.0317362.g009

thumbnail
Fig 10.

Speech spectrograms extracted by adjusting the size of corresponding to the generalized S-transform when λL = 0.01, and (a-d) Figs. are the speech spectrograms extracted by different λG sizes corresponding to the size of the parameter settings, respectively: (a)(λG = 0.01, λL = 0.01), (b)(λG = 0.1, λL = 0.01), (c)(λG = 1, λL = 0.01), (d)(λG = 10, λL = 0.01).

https://doi.org/10.1371/journal.pone.0317362.g010

thumbnail
Fig 11. Spectrograms extracted from the generalized S-transform corresponding to the adjustment parameters λG = 0.05, λL = 0.01.

https://doi.org/10.1371/journal.pone.0317362.g011

From the S-transform results (Fig 6), it can be seen that the speech signal is S-transformed to show the distribution of the fundamental frequency, but when showing the resonance peaks, the time resolution is high, the frequency resolution is low, and the fundamental part and the second resonance peak are not finely delineated into two parts.

Upon comparing and analyzing Fig 11, it is clear that the generalized S-transform result allows for a finer distinction between the fundamental frequency and each resonance peak of the speech signal, with a marked improvement in the frequency resolution of both the fundamental and resonance peaks. The figure shows that the fundamental frequency of the speech signal lies between 100Hz and 200Hz. The signal strength at each time point is represented by the grayscale in the speech spectrogram. The curve parallel to the fundamental frequency waveform corresponds to the resonance peak information at the same time period. The speech spectrogram effectively reflects the dynamic spectral characteristics of the speech signal, providing a visual representation of the speech.

Under the same experimental conditions, time-frequency analyses of the speech signal were performed using the Short-Time Fourier Transform (STFT) and the wavelet transform (using a Morlet mother wavelet with a center frequency parameter of 50 and a bandwidth parameter of 3). The resulting spectrograms are shown in Figs 12 and 13, respectively. The spectrograms obtained from the generalized S-transform in Fig 11 are similar to those from the STFT in Fig 12, with the fundamental frequency primarily distributed between 100 Hz and 200 Hz. Within the effective speech range, the patterns of speech sound distribution are consistent, and the resonance peaks are distinct and similarly distributed.

thumbnail
Fig 12. Speech spectrogram extracted from speech signal processed by short time Fourier transform method.

https://doi.org/10.1371/journal.pone.0317362.g012

thumbnail
Fig 13. Speech spectrogram extracted from speech signal processed by wavelet transform method (Morlet mother wavelet with center frequency parameter of 50 and bandwidth parameter of 3 was selected).

https://doi.org/10.1371/journal.pone.0317362.g013

In contrast, Fig 13, which shows the spectrogram obtained via wavelet transform, demonstrates higher fundamental frequency resolution but lower time resolution. In the distribution of high-frequency resonance peaks, the time resolution is higher while the frequency resolution is lower. When compared with the fundamental frequency distribution in Fig 11, the two are similar, and the acoustic texture distribution within the effective speech is comparable, corresponding to the time-domain distribution of the speech signal in Fig 5. Comparative analysis with the spectrograms obtained from the STFT and wavelet transform indicates that the generalized S-transform is a feasible method for extracting speech signal spectrograms.

5 Evaluating the effectiveness of GST in acquiring speech maps

The experiment demonstrates that GST is a feasible method for extracting speech spectrograms. To validate the effectiveness of GST in obtaining accurate speech spectrograms, the inverse spectral method is employed to extract the fundamental frequency curve of speech, serving as a reference index. The effectiveness of the spectrogram is then verified against this reference. The formula for extracting the speech gene frequency curve using the inverse spectral method is as follows:

The Fourier transform of the signal x(n) is given by: (16)

The process of applying a logarithm to the amplitude after performing the Fourier transform, followed by solving for the inverse Fourier transform, is known as cepstrum [2]. (17)

The cepstrum sequence of x(n) is the inverse Fourier transform of the logarithm of the magnitude spectrum of x(n), where FT and FT−1 denote the Fourier transform and inverse Fourier transform, respectively.

A similar method is applied to the speech signal, where the speech signal x(n) is obtained from the excitation pulse u(n) filtered by the vocal tract response v(n). The speech signal can be modeled as: (18)

The Fourier transform of this model is: (19)

Solving for the cepstrum yields: (20)

From Eqs (16) to (20), it is evident that the cepstrum of the acoustic pulse FT−1ln(|U(ω)|) and the cepstrum of the vocal tract response FT−1ln(|V(ω)|) can be separated by the cepstral algorithm. This separation allows for the recovery of the excitation signal u(n) from the cepstrum domain FT−1ln(|U(ω)|). The distribution of the fundamental frequency curve in the final speech signal is shown in Fig 14 below:

thumbnail
Fig 14.

Cepstrum method for extracting the fundamental frequency curve in a speech signal, where the top graph shows the time domain distribution of the speech signal, the middle graph shows the distribution of fundamental period sampling points, and the bottom graph shows the distribution of gene frequency calculated by the sampling points of the middle graph.

https://doi.org/10.1371/journal.pone.0317362.g014

The fundamental frequency curve is obtained using the inverse spectral method and mapped onto the spectrogram. To verify the effectiveness of the generalized S-transform (GST) in obtaining the spectrogram, it is compared with the STFT and wavelet transform algorithms, using the fundamental frequency curve as a reference. In this comparison, the STFT uses a fixed Hamming window with a width of 30 ms, while the wavelet transform employs the Morlet wavelet with a center frequency parameter of 3 and a bandwidth parameter of 3. The results of each algorithm are shown in Figs 15 to 17 below.

thumbnail
Fig 15. Generalized S-transform to extract the speech spectrogram and superimpose the fundamental frequency profile onto the spectrogram resultant plot.

https://doi.org/10.1371/journal.pone.0317362.g015

thumbnail
Fig 16. Resulting plot of STFT extracting the speech spectrogram and superimposing the fundamental frequency curve onto the spectrogram.

https://doi.org/10.1371/journal.pone.0317362.g016

thumbnail
Fig 17. Wavelet transform to extract the speech spectrogram and superimpose the fundamental frequency curve onto the spectrogram resultant plot.

https://doi.org/10.1371/journal.pone.0317362.g017

Fig 15 shows the spectrogram of the fundamental frequency obtained by GST (with selected parameter values of λG = 0.2 and λL = 0.1, providing high resolution of the fundamental frequency). Based on the alignment of the fundamental frequency curves and the spectrogram’s distribution, it can be confirmed that the speech spectrogram extracted by the generalized S-transform proposed in this paper is effective. Figs 16 and 17 illustrate the spectrograms obtained by the STFT and wavelet transform, respectively. The comparison of these figures, under the same fundamental frequency distribution curve, shows that the distribution of the fundamental frequency is consistent across the spectrograms extracted by all three methods. This verifies that the speech spectrograms extracted by the generalized S-transform method proposed in this paper are both feasible and effective for speech signals.

6 Conclusion

In this paper, we introduce a low-pass filter to the time window function based on the S-transform and incorporate two adjustment parameters to control the time and frequency resolution of the generalized S-transform. By selecting appropriate adjustment parameters, the generalized S-transform can be effectively tailored for processing speech signals. Experimental comparisons demonstrate that the generalized S-transform offers high flexibility and feasibility in extracting speech spectrograms. In the future, our research group will explore the application of the generalized S-transform in noisy speech and the train with large datasets.

Supporting information

References

  1. 1. Kwong S, Gang W, Lee CH. A pitch detection algorithm based on time-frequency analysis. In: [Proceedings] Singapore ICCS/ISITA ‘92; 1992. p. 432–436 vol.2.
  2. 2. Rabiner LR, Schafer RW. Theory and Applications of Digital Speech Processing. Journal of Applied Mechanics. 2010;30(1):445–447.
  3. 3. Mulimani M, Koolagudi SG. Acoustic Event Classification Using Spectrogram Features. In: TENCON 2018—2018 IEEE Region 10 Conference; 2018. p. 1460–1464.
  4. 4. H M C, Karjigi V, Sreedevi N. Investigation of Different Time-Frequency Representations for Intelligibility Assessment of Dysarthric Speech. IEEE Transactions on Neural Systems and Rehabilitation Engineering. 2020;28(12):2880–2889. pmid:33141673
  5. 5. Chen D, Xiao T, Hu W, Wu Q. Pitches Detection of Mixed Speech using Synchrosqueezing Wavelet Transform. In: 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI); 2021. p. 254–260.
  6. 6. Elfaki A, Asnawi AL, Jusoh AZ, Ismail AF, Ibrahim SN, Mohamed Azmin NF, et al. Using the Short-Time Fourier Transform and ResNet to Diagnose Depression from Speech Data. In: 2021 IEEE International Conference on Computing (ICOCO); 2021. p. 372–376.
  7. 7. Kaneko T, Tanaka K, Kameoka H, Seki S. ISTFTNET: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform. In: ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2022. p. 6207–6211.
  8. 8. Radha K, Rao DV, Sai KVK, Krishna RT, Muneera A. Detecting Autism Spectrum Disorder from Raw Speech in Children using STFT Layered CNN Model. In: 2024 International Conference on Green Energy, Computing and Sustainable Technology (GECOST); 2024. p. 437–441.
  9. 9. Kornsing S, Srinonchat J. Enhancement Speech Compression Technique Using Modern Wavelet Transforms. In: 2012 International Symposium on Computer, Consumer and Control; 2012. p. 393–396.
  10. 10. Li R, Bao C, Xia B, Jia M. Speech enhancement using the combination of adaptive wavelet threshold and spectral subtraction based on wavelet packet decomposition. In: 2012 IEEE 11th International Conference on Signal Processing. vol. 1; 2012. p. 481–484.
  11. 11. Stockwell RG, Mansinha L, Lowe RP. Localization of the complex spectrum: the S transform. IEEE Transactions on Signal Processing. 1996;44(4):998–1001.
  12. 12. Stockwell RG. A basis for efficient representation of the S-transform. Digital Signal Processing. 2007;17(1):371–393.
  13. 13. Pinnegar R, Mansinha L. The S-transform with windows of arbitrary and varying shape. GEOPHYSICS. 2003;68.
  14. 14. Liu W, Zhai Z, Fang Z. A Multisynchrosqueezing-Based S-Transform for Time-Frequency Analysis of Seismic Data. Pure and Applied Geophysics. 2024; p. 1–17.
  15. 15. Aunsri N, Jakkaew P, Kuptametee C. Investigation and evaluation of cross-term reduction in masked Wigner-Ville distributions using S-transforms. PLOS ONE. 2024;19(11):e0310721. pmid:39504330
  16. 16. Wei D, Shen J. Synchrosqueezing Fractional S-transform: Theory, Implementation and Applications. Circuits Syst Signal Process. 2023;43(3):1572–1596.
  17. 17. Wang Y, Peng Z, He Y. Time-frequency representation for seismic data using sparse S transform. In: 2016 2nd IEEE International Conference on Computer and Communications (ICCC); 2016. p. 1923–1926.
  18. 18. Thangaraj K, Muruganandham J, Selvaumar S, Jagan R. Analysis of harmonics using S-Transform. In: 2016 International Conference on Emerging Trends in Engineering, Technology and Science (ICETETS); 2016. p. 1–5.
  19. 19. Liu N, Gao J, Zhang B, Li F, Wang Q. Time–Frequency Analysis of Seismic Data Using a Three Parameters S Transform. IEEE Geoscience and Remote Sensing Letters. 2018;15(1):142–146.
  20. 20. ZHANG Xian-Wu FGY GAO Yun-Ze. Application of generalized S transform with lowpass filtering to layer recognition of Ground Penetrating Radar. Chinese Journal of Geophysics (in Chinese). 2013;56(1):309–316.
  21. 21. Rao X, Xu Z, Guan H, Ding Y, Yang X, Yang Y. Cable Defect Location by Using Frequency Domain Reflectometry with Synchrosqueezing Generalized S-Transform. In: 2023 Panda Forum on Power and Energy (PandaFPE); 2023.p. 1178–1182.
  22. 22. Wang GY, Zhang YM, Sun ML, Wang X, Zhang Y. Speech signal feature parameters extraction algorithm based on PCNN for isolated word recognition. In: 2016 International Conference on Audio, Language and Image Processing (ICALIP); 2016. p. 679–682.
  23. 23. Wu D, Tao Z, Wu Y, Shen C, Xiao Z, Zhang X, et al. Speech endpoint detection in noisy environment using Spectrogram Boundary Factor. In: 2016 9th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI); 2016. p. 964–968.
  24. 24. Prasomphan S. Detecting human emotion via speech recognition by using speech spectrogram. In: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA); 2015. p. 1–10.
  25. 25. M R N, Mohan B S S. Music Genre Classification using Spectrograms. In: 2020 International Conference on Power, Instrumentation, Control and Computing (PICC); 2020. p. 1–5.