Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Enhancing bowel sound recognition with self-attention and self-supervised pre-training

  • Yansuo Yu ,

    Roles Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Visualization, Writing – original draft, Writing – review & editing

    yuyansuo@bipt.edu.cn

    Affiliation Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China

  • Mingwu Zhang,

    Roles Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization

    Affiliation Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China

  • Zhennian Xie,

    Roles Supervision, Writing – review & editing

    Affiliation Xiyuan Hospital, Chinese Academy of Traditional Chinese Medicine, Beijing, China

  • Qiang Liu

    Roles Project administration, Supervision, Writing – review & editing

    Affiliation Academy of Artificial Intelligence, Beijing Institute of Petrochemical Technology, Beijing, China

Abstract

Bowel sounds, a reflection of the gastrointestinal tract’s peristalsis, are essential for diagnosing and monitoring gastrointestinal conditions. However, the absence of an effective, non-invasive method for assessing digestion through auscultation has resulted in a reliance on time-consuming and laborious manual analysis by clinicians. This study introduces an innovative deep learning-based method designed to automate and enhance the recognition of bowel sounds. Our approach integrates the Branchformer architecture, which leverages the power of self-attention and convolutional gating for robust feature extraction, with a self-supervised pre-training strategy. Specifically, the Branchformer model employs parallel processing of self-attention and convolutional gated Multi-layer Perceptron branches to capture both global and local dependencies in audio signals, thereby enabling effective characterization of complex bowel sound patterns. Furthermore, a self-supervised pre-training strategy is employed, leveraging a large corpus of unlabeled audio data to learn general sound wave representations, followed by fine-tuning on a limited set of bowel sound data to optimize the model’s recognition performance for specific tasks. Experimental results on public bowel sound datasets demonstrate the superior recognition performance of the proposed method compared to existing baseline models, particularly under data-limited conditions, thereby confirming the effectiveness of the self-supervised pre-training strategy. This work provides an efficient and automated solution for clinical bowel sound monitoring, facilitating early diagnosis and treatment of gastrointestinal disorders.

Introduction

Bowel sounds (BS), the direct reflection of gastrointestinal motility, are sporadic noises originating from the gas and fluid movement within the intestines. These intermittent sounds hold substantial significance in assessing the health status of the human gastrointestinal tract [1, 2]. The genesis mechanism of these bowel sounds is intrinsically tied to the dynamic interplay of gas and fluid within the gut, thereby embodying its diverse physiological and pathological states. In the realm of clinical practice, the surveillance of these bowel sounds plays a critical role in diagnosing and monitoring numerous gastrointestinal diseases, such as intestinal obstruction, inflammatory bowel disease, and dyspepsia. However, the study and clinical application of bowel sounds are limited due to their seemingly random nature, wide dynamic range, and potential dietary influences. Conventionally, the gathering and scrutiny of bowel sounds depend on subjective auscultation by medical professionals [3, 4]. This approach, while instinctual and pervasive, is vulnerable to intervention from environmental noise and constrained by the doctor’s personal experience, and the brevity of data collection period. Consequently, it might compromise the accuracy and reliability of the outcomes.

The research background of bowel sound recognition technology dates back to early acoustic analysis methods [5, 6], which primarily relied on traditional signal processing techniques such as Fourier transform to analyze bowel sound signals. Over time, researchers began to explore machine learning algorithms [7], such as Support Vector Machine (SVM) [8] and Naive Bayesian [9], to enhance the automation level of recognition. For instance, Yin Y et al. [8] proposed a bowel sound recognition method based on SVM, which employs Legendre polynomial fitting on the logarithmic amplitude spectrum to effectively filter out non-essential information. This method ingeniously applies Principal Component Analysis (PCA) to dimensionally reduce acoustic features within a specific frequency range, and further optimizes the SVM model with the least squares method and radial basis function kernel, significantly enhancing recognition accuracy. Ulusar [9] utilized a Naive Bayes algorithm for bowel sound pattern classification, coupled with minimum statistics and spectral subtraction techniques for noise reduction, achieving satisfactory results in actual audio recording tests. However, these traditional methods [10, 11] depend on specialized knowledge for feature extraction and struggle to adapt to the complex variations in bowel sound signals.

The advent of Deep Neural Networks (DNN) has marked a significant paradigm shift [1215] in the domain of bowel sound recognition. Convolutional Neural Networks (CNN), in particular, have harnessed the power of spatial feature extraction in audio signals, capitalizing on their local receptive fields and weight sharing efficiencies [1618]. Zhao et al. [17] explored a variety of CNN architectures for BS detection within 5-second audio segments, albeit without sufficient detail for method replication. A subsequent study [19] employed a CNN tailored to 1-second segments, achieving promising results with 1-minute audio clips from 28 participants under conditions of maximum Signal-to-Noise Ratio (SNR). Recurrent Neural Networks (RNN), especially Long Short-Term Memory (LSTM) networks [20], have garnered attention for their proficiency in capturing temporal dynamics within BS signals, a testament to their prowess in time-series data analysis. Liu et al. [21] exemplified this by training an LSTM on Mel-Frequency Cepstral Coefficients (MFCC), using non-overlapping windows to identify BS sections exceeding 100 milliseconds. However, they noted a decline in performance amidst varying environmental noise. Moreover, the fusion of CNN and RNN in hybrid models [22, 23] has propelled recognition performance to new heights. These models offer an innovative pathway for the automatic classification and analysis of bowel sounds, adeptly integrating both spatial and temporal features to enhance diagnostic accuracy and reliability.

Research on bowel sound recognition technology has gained increasing attention in recent years, yet it continues to faces numerous challenges. On one hand, current models are deficient in automatic feature extraction and handling of long-term dependencies. Although models such as CNNs [24] and LSTMs [21] have achieved certain successes in specific tasks, they are limited in capturing the subtle changes and long-term patterns of bowel sound signals. Moreover, the training process of these models [25, 26] often requires extensive manual intervention and feature engineering, which not only increases the complexity of research but also limits the models’ generalization capabilities. On the other hand, there is a relative scarcity of high-quality bowel sound data. The collection of bowel sound data is complex and costly, and the data quality is influenced by various factors, such as the precision of sensors, changes in the subject’s body position, and dietary habits. These factors lead to inconsistent data quality, adding to the difficulty of analysis and recognition.

Recently, the attention mechanism [27, 28] has emerged as a pivotal force driving technological advancements within the field of deep learning. It plays a significant role across various sub-domains such as computer vision [29, 30], speech recognition [31], and natural language processing [3234]. The attention mechanism [27], a technique within artificial neural networks that simulates human cognitive attention [35], adjusts the network’s focus on different parts of the input data, thereby endowing the neural network with the capability to concentrate on key features within the data. This mechanism amplifies the weight of the more important aspects of the data while diminishing the influence of others, allowing the network’s focus to be dynamically allocated according to the context’s requirements. Concurrently, the rise of self-supervised pre-training methods [32, 33, 36] has effectively addressed the issue of insufficient labeled data. These methods leverage a vast amount of unlabeled data to extract features with broad applicability, significantly enhancing the model’s generalization capabilities. Notably, pre-trained models like Bidirectional Encoder Representations from Transformers (BERT) [33] have mitigated reliance on extensive manually annotated datasets, bolstering the model’s adaptability to specific tasks after fine-tuning, thus sparking a transformation in the field of natural language processing.

Building upon these technological breakthroughs, this study introduces an innovative method for bowel sound recognition. The method integrates advanced attention mechanisms with self-supervised pre-training strategies, aiming to surmount the challenges of bowel sound recognition under data-limited conditions. Firstly, a bowel sound recognition method based on Branchformer [37, 38], which leverages the parallel processing capabilities of its self-attention and Multi-layer Perceptron with convolutional gating (cgMLP) modules to effectively capture both global and local features of audio signals, thus enabling a deeper understanding of bowel sounds. Secondly, a self-supervised pre-training strategy [33, 34] is introduced. This strategy involves pre-training on a large amount of unlabeled audio data to learn a universal representation of sound waves [39, 40], followed by fine-tuning on a limited set of high-quality bowel sound datasets to optimize the model’s recognition performance for specific tasks. This integrated approach is expected to improve the accuracy and efficiency of bowel sound analysis, offering new insights and tools for medical diagnostics.

The research contributions of this paper are primarily reflected in the following aspects:

  • A bowel sound recognition method based on Branchformer is proposed, which effectively captures both global and local features of audio signals through its parallel branch architecture, enhancing the model’s understanding of bowel sound signals.
  • A self-supervised pre-training strategy is introduced, which addresses the issue of insufficient high-quality bowel sound data by pre-training on unlabeled audio data, thereby improving the model’s generalization capabilities.
  • Fine-tuning on a limited set of high-quality bowel sound datasets optimizes the model’s performance for specific recognition tasks, enabling the training of high-performance bowel sound recognition models even when labeled data is scarce.
  • The effectiveness of the proposed methods is validated through experiments, demonstrating the significant advantages of the Branchformer model and self-supervised pre-training strategy over existing technologies in the task of bowel sound recognition.

The structure of this paper is as follows: The second section will provide a detailed introduction to the proposed bowel sound recognition method based on Branchformer and the specific implementation of the self-supervised pre-training strategy; The third section will present the experimental results and compare them with existing technologies; Finally, the fourth section will summarize the paper and discuss future research directions.

Materials and methods

The comprehensive methodology for training the bowel sound recognition model, as shown in Fig 1, was conducted over an eighteen-month period and includes data collection, preprocessing, and model training stages. The model training procedure is structured into sequential steps: acquisition of raw data, preprocessing, acoustic feature extraction, and model development. During the initial analysis phase, two categories of samples are identified: bowel sound events and non-bowel sound events, both annotated by medical professionals. The preprocessing phase is meticulously executed through four critical steps—segmentation, normalization, pre-emphasis, and window framing—to enhance signal fidelity and mitigate interference with the deep learning model. Following preprocessing, the focus shifts to the extraction of acoustic features from each bowel sound signal frame, a pivotal phase in the development of the recognition model. With the requisite features secured, the model training phase commences, utilizing a partitioned dataset comprising training, testing, and validation subsets to ensure robust training, model validation, and performance assessment.

The selection of parameters, including sampling rate, frame length, frame shift, and window function, plays a crucial role in effectively capturing and representing the intricate characteristics of bowel sound signals. The sampling rate of 44.1kHz was chosen to preserve the high-fidelity information present in the original audio recordings, ensuring accurate representation of the subtle nuances within the bowel sound signals. Additionally, evaluating the model’s performance across varying sampling rates (8kHz and 22.05kHz) allowed for assessing its robustness and adaptability to diverse audio sources. The frame length of 25ms strikes an optimal balance between capturing sufficient signal information and maintaining computational efficiency, aligning with established practices in audio signal processing. The frame shift, or hop length, of 10ms facilitates a reasonable overlap between consecutive frames, enabling the model to effectively capture temporal dependencies while mitigating excessive redundancy. Furthermore, the Hamming or Hanning window function was employed to reduce spectral leakage and minimize discontinuities at frame boundaries, thereby minimizing artifacts in the frequency domain representation, a common consideration in audio processing applications.

Dataset preparation

The public bowel sound dataset used in this study comes from [23], covering a total of 19 participants’ bowel sound records and compiling 1,605 audio files, each with a duration of 2 seconds. These audio files were collected using a device equipped with a dedicated contact microphone and were saved in mono-channel WAV format with a sampling rate of 44.1kHz and a sampling depth of 24 bits. In addition, each audio file is accompanied by a corresponding CSV format label file containing information on bowel sound event locations, highest and lowest frequencies. We have processed the audio data with the librosa [41] library in Python and aligned it synchronously with the labeled data, resulting in a total of 6,378 data samples. Among these samples, 3,699 were marked as bowel sound fragments, and 2,679 were marked as non-bowel sound fragments. This classification helps to increase the balance between categories in the dataset thus avoiding any form of bias or skew. In addition, to maintain training consistency and reduce potential impacts due to volume differences, the amplitudes of all samples were normalized to a range of [-1,1]. As such, the adaptability of the model has been enhanced, allowing it to better handle data under different recording conditions, thereby enhancing its generalization ability.

Framework of the Branchformer model

Branchformer is an innovative encoder architecture primarily designed for speech recognition and understanding tasks [37, 38]. It captures both local and global context information through parallel multi-layer perceptrons (MLPs) and attention mechanisms. The main advantages of this architecture lie in its flexibility, interpretability, customizability, and efficient computational performance. Compared to traditional RNNs, it can better parallelize data processing and more effectively capture long-distance dependencies. Unlike CNNs, it extends its focus beyond local dependencies to encompass global context, facilitated by the self-attention mechanism, which is crucial for processing continuous speech data. Additionally, Branchformer surpasses Transformers [27] by enhancing local feature extraction via the Multi-layer Perceptron with convolutional gating (cgMLP) module and by incorporating interchangeable self-attention variants to enhance flexibility and efficiency.

The overall architecture of the Branchformer encoder, as illustrated in Fig 2, begins with a frontend module designed to process the raw audio sequence and extract low-level acoustic features. Subsequently, a convolutional subsampling module is engaged to temporally downsample the feature sequence. The encoder is composed of N sequentially stacked, identical Branchformer blocks, each adept at capturing both global and local feature dynamics. Within each Branchformer block, three key components function in concert: a global extractor branch, a local extractor branch, and a merging module that integrates their outputs. The global feature extractor operates via a self-attention mechanism tailored to identify long-range dependencies, whereas the local feature extractor operates through a cgMLP to capture nuanced local interactions. This dual-branch structure allows the model to dynamically balance the importance of local and global contexts across layers, enhancing overall performance.

thumbnail
Fig 2. Overall architecture of the Branchformer encoder.

A stack of identical Branchformer blocks are used to capture local and global dependencies.

https://doi.org/10.1371/journal.pone.0311503.g002

Global extractor branch.

The global extractor branch within the Branchformer architecture, as shown in Fig 3(a), is designed to capture the global context of the input sequence through a self-attention mechanism. This branch operates on the premise of the Transformer’s pre-norm [42] setup, where a layer norm (LN) [43], multi-head self-attention (MHSA) [27], and dropout [44] are applied sequentially. The input to this branch is a feature sequence , which is transformed into query, key, and value matrices . The self-attention mechanism calculates the scaled dot product between queries and keys, resulting in a weighted combination of values that represent the global context . This process is encapsulated in the following equations: (1)

thumbnail
Fig 3. Global extractor and local extractor in Branchformer.

Global extractor employs attention to capture global context, while local extractor utilizes the MLP with convolutional gating to extract local context.

https://doi.org/10.1371/journal.pone.0311503.g003

Local extractor branch.

The local extractor branch, as illustrated in Fig 3(b), focuses on the local context within the sequence, achieved through a module known as cgMLP [45]. This module uses depth-wise convolution and linear gating to process the input features and capture local dependencies. The cgMLP consists of channel projections, a Gaussian Error Linear Unit (GELU) activation [46] function, a convolutional spatial gating unit (CSGU) [45], and additional channel projections. The local context is extracted through a series of operations as follows: (2) (3) (4) (5) where , are intermediate hidden features, and , denote the trainable weights of two linear projections.

Merging branches.

The Branchformer encoder merges the outputs from both the global and local branches using either concatenation or weighted average methods. The concatenation method is the default, where the outputs YG and YL are combined and then projected back to the original dimension d using a linear projection . This is represented by the following equation: (6)

Alternatively, a weighted average method can be employed, where the model dynamically generates weights for each branch, providing a more interpretable representation of how global and local dependencies are integrated. This method allows for a flexible and customizable approach to merging the branches’ outputs.

Complexity analysis.

The complexity of the Branchformer architecture is analyzed with respect to sequence length T and feature dimension d. The attention-based branch, which may incorporate self-attention or the Fastformer mechanism, has a complexity of O(T2d) for standard self-attention and O(Td) for Fastformer. The cgMLP branch operates with linear complexity O(Td), making it efficient for processing sequences of varying lengths. Moreover, the Branchformer model employs strategies aimed at reducing computational complexity, such as the incorporation of branch dropout during the training phase. This approach expedites inference processes without significantly compromising the model’s accuracy.

Strategy of self-supervised pre-training

Given the limited availability of high-quality bowel sound data, this study adopts a self-supervised pre-training strategy to enhance the model’s understanding of bowel sound signals. The training process of the bowel sound recognition model based on self-supervised pre-training is depicted in Fig 4.

thumbnail
Fig 4. Training process diagram of the bowel sound recognition model based on self-supervised pre-training.

https://doi.org/10.1371/journal.pone.0311503.g004

Initially, in the pre-training phase, the model is trained on unlabeled audio data. By designing various self-supervised tasks, such as predicting future frames or reconstructing masked segments, the model learns a universal representation of sound. Two pre-training models were selected for this study: HuBERT [39] and wav2vec 2.0 [40]. Both are self-supervised speech pre-training models capable of learning useful representations from raw audio. These pre-trained models are then fine-tuned on a limited set of high-quality bowel sound datasets to adapt to specific recognition tasks. This strategy not only improves the model’s ability to recognize bowel sound signals but also alleviates the reliance on a large amount of labeled data. Building upon the pre-trained models, this paper further implements two fine-tuning strategies: full-parameter fine-tuning and low-parameter fine-tuning. Full-parameter fine-tuning involves optimizing all parameters of the pre-trained model to better adapt to the specific bowel sound recognition task. Although this method can achieve higher recognition accuracy, it comes with higher computational costs. To address this issue, this paper also explores a low-parameter fine-tuning method, which optimizes only a portion of the model’s parameters to lighten the model while maintaining high recognition performance.

  • HuBERT [39], as shown in Fig 5, is a novel self-supervised speech representation learning model that employs a clustering-based approach to generate pseudo-labels from raw audio and utilizes these pseudo-labels for self-supervised training. This method enables HuBERT to effectively learn complex patterns and structures within audio signals, thereby excelling in tasks such as speech recognition. By undergoing pre-training and fine-tuning, HuBERT adapts to specific speech processing tasks, thereby enhancing recognition accuracy and efficiency.
  • wav2vec 2.0 [40], as shown in Fig 6, is an advanced self-supervised speech pre-training model that effectively captures important features of speech by training on a large amount of unlabeled audio data. The model’s training process consists of two main stages: pre-training and fine-tuning. During the pre-training phase, the model learns to encode raw audio waveforms and acquires useful representations by addressing masking tasks within the audio. The fine-tuning phase then employs a small amount of labeled data to adjust the model for specific speech recognition tasks.
thumbnail
Fig 5. Illustration of HuBERT, which predicts hidden cluster assignments of the masked frames.

https://doi.org/10.1371/journal.pone.0311503.g005

thumbnail
Fig 6. Illustration of wav2vec 2.0, which jointly learns contextualized acoustic representations and an inventory of discretized acoustic units.

https://doi.org/10.1371/journal.pone.0311503.g006

Results

To evaluate the performance of the bowel sound recognition model, this paper employs the calculation of accuracy as the experimental evaluation metric. The formula for calculating accuracy is as follows: (7) Where TP represents the number of samples that belong to the bowel sound category and are correctly identified as such by the recognition model, TN represents the number of samples that belong to the non-bowel sound category and are correctly identified as such by the recognition model, P represents the total number of samples that belong to the bowel sound category, and N represents the total number of samples that belong to the non-bowel sound category.

Performance comparison of various prediction models

In this study, a series of comprehensive experimental schemes were devised to deeply evaluate the performance of various deep learning models in the task of bowel sound recognition. The recognition performance of each model was systematically assessed under various conditions, including processing multiple acoustic features, different sampling rates, frame lengths, frame shifts, window functions, and feature lengths. The acoustic features utilized in the experiments included Mel-frequency cepstral coefficients (MFCC), Linear Predictive Coding (LPC), and Mel-Spectrum. In this series of experiments, we selected four representative deep learning models, namely Convolutional Neural Networks (CNN), Long Short-Term Memory networks (LSTM), a combined model of CNN and LSTM (CNN+LSTM), and ResNet34 [47].

As illustrated in Table 1 and Fig 7, the Branchformer model demonstrated consistent outperformance relative to other models across a variety of sampling rates and feature types, thereby showcasing its superiority. Utilizing Mel-spectrogram features, the Branchformer model secured remarkable accuracy scores, varying from 0.7012 to 0.7498, corresponding to sampling rates of 8 kHz, 22.05 kHz, and 44.1 kHz, respectively. This performance advantage was statistically significant, as indicated by the distinct separation of boxplot whiskers and the associated low p-values. While the ResNet34 model delivered commendable results within the Mel-spectrogram feature range of 0.6807 to 0.6973, it was outperformed by the Branchformer. The trend observed was that increased sampling rates generally enhanced accuracy for the majority of models and features, presumably because of the greater detail captured in the signal. The Branchformer’s consistent excellence across all sampling rates confirms its robustness and proficiency in managing the acoustic characteristics of bowel sounds.

thumbnail
Fig 7. Box plot of recognition accuracy among different model architectures when utilizing different acoustic features of bowel sound signals across different sampling rates.

Horizontal bars depict Mann–Whitney U tests for significance of differences in accuracy between Branchformer and other models. **p-value ≤ 0.01, ***p-value ≤ 0.001, ****p-value ≤ 0.0001.

https://doi.org/10.1371/journal.pone.0311503.g007

thumbnail
Table 1. Comparative analysis of recognition accuracy for different model architectures using different acoustic features of bowel sound signals across different sampling rates.

Note that sr indicates the sampling rate.

https://doi.org/10.1371/journal.pone.0311503.t001

As depicted in Table 2 and Fig 8, the Branchformer model demonstrated consistent outperformance compared to other models across various feature types and frame lengths, with notably higher accuracy. Specifically, utilizing Mel-spectrogram features, the Branchformer model secured accuracy scores between 0.7584 and 0.7759 across all frame lengths, a statistically significant advantage as indicated by the distinct separation of boxplot whiskers and the corresponding low p-values. While the ResNet34 model provided commendable results within the Mel-spectrogram feature range of 0.6852 to 0.6984, it was outperformed by the Branchformer. The Mel-spectrogram features generally outperformed MFCC and LPC in capturing the nuances of bowel sounds, enhancing accuracy across the majority of models. The impact of frame length on model accuracy was pronounced, yet the Branchformer maintained remarkable stability and precision across all frame lengths, particularly in conjunction with Mel-spectrogram features.

thumbnail
Fig 8. Box plot of recognition accuracy among different model architectures when utilizing different acoustic features of bowel sound signals across different frame lengths.

Horizontal bars depict Mann–Whitney U tests for significance of differences in accuracy between Branchformer and other models. **p-value ≤ 0.01, ***p-value ≤ 0.001, ****p-value ≤ 0.0001.

https://doi.org/10.1371/journal.pone.0311503.g008

thumbnail
Table 2. Comparative analysis of recognition accuracy for different model architectures using different acoustic features of bowel sound signals across different frame lengths.

Note that fl indicates the frame length.

https://doi.org/10.1371/journal.pone.0311503.t002

As illustrated in Table 3 and Fig 9, the Branchformer model demonstrated consistent outperformance relative to other methods across various feature types and frame shift sizes, thereby showcasing its superior accuracy. Utilizing Mel-Spectrogram features, the Branchformer model attained the highest accuracy scores, varying from 0.7053 to 0.7358, for frame shifts of 441, 529, and 661, respectively. This performance advantage was statistically significant, as indicated by the non-overlapping boxplot whiskers and the corresponding low p-values. Although the ResNet34 model provided commendable results within the Mel-Spectrogram feature range of 0.6795 to 0.6879, it was surpassed by the Branchformer. A trend was observed where smaller frame shifts generally led to higher accuracy across the majority of models and features, potentially due to the improved capture of continuous sound characteristics through enhanced frame-to-frame correlation. The Branchformer model maintained remarkable stability and precision across all tested frame shift sizes, which underscores its robustness in processing bowel sound data.

thumbnail
Fig 9. Box plot of recognition accuracy among different model architectures when utilizing different acoustic features of bowel sound signals across different frame shifts.

Horizontal bars depict Mann–Whitney U tests for significance of differences in accuracy between Branchformer and other models. **p-value ≤ 0.01, ***p-value ≤ 0.001, ****p-value ≤ 0.0001.

https://doi.org/10.1371/journal.pone.0311503.g009

thumbnail
Table 3. Comparative analysis of recognition accuracy for different model architectures using different acoustic features of bowel sound signals across different frame shifts.

Note that fs indicates the frame shift.

https://doi.org/10.1371/journal.pone.0311503.t003

As depicted in Table 4 and Fig 10, the Branchformer model demonstrated superior performance over alternative methods when evaluated across varying feature types and lengths. Utilizing Mel-Spectrogram features, it achieved a range of exceptional accuracy scores between 0.7352 and 0.7484 for feature lengths of 131, 164, and 199, respectively. This performance advantage was statistically significant, as illustrated by the non-overlapping boxplot whiskers and the corresponding low p-values. In comparison, the ResNet34 model achieved a lower accuracy range of 0.6781 to 0.6859 with Mel-Spectrogram features. An observed trend indicated that increased feature lengths were associated with higher accuracy for the majority of models, potentially due to the capture of more intricate sound variations. The Branchformer model displayed remarkable adaptability across all feature lengths, with a notable increase in performance for longer features, thereby highlighting its robustness in processing complex acoustic characteristics of bowel sounds.

thumbnail
Fig 10. Box plot of recognition accuracy among different model architectures when utilizing different acoustic features of bowel sound signals across different feature lengths.

Horizontal bars depict Mann–Whitney U tests for significance of differences in accuracy between Branchformer and other models. **p-value ≤ 0.01, ***p-value ≤ 0.001, ****p-value ≤ 0.0001.

https://doi.org/10.1371/journal.pone.0311503.g010

thumbnail
Table 4. Comparative analysis of recognition accuracy for different model architectures using different acoustic features of bowel sound signals across different feature lengths.

Note that tl indicates the feature length.

https://doi.org/10.1371/journal.pone.0311503.t004

As detailed in Table 5 and illustrated in Fig 11, the Branchformer model demonstrated superior performance over alternative methods when evaluated across various window functions and feature types. Utilizing Mel-Spectrogram features, the Branchformer model attained an accuracy range of 0.7135 to 0.7458 with Hanning, Hamming, and Rectangular windows, respectively. This performance advantage was statistically significant, as indicated by the non-overlapping boxplot whiskers and the corresponding low p-values. In contrast, the ResNet34 model achieved a comparatively lower accuracy range of 0.6324 to 0.6887 with Mel-Spectrogram features. The Hamming and Hanning windows typically resulted in higher accuracy compared to the Rectangular window for the majority of models, underscoring the critical role of appropriate window function selection. The Branchformer model displayed remarkable adaptability across all window functions, thereby highlighting its robustness in processing diverse acoustic features.

thumbnail
Fig 11. Box plot of recognition accuracy among different model architectures when utilizing different acoustic features of bowel sound signals across different window functions.

Horizontal bars depict Mann–Whitney U tests for significance of differences in accuracy between Branchformer and other models. **p-value ≤ 0.01, ***p-value ≤ 0.001, ****p-value ≤ 0.0001.

https://doi.org/10.1371/journal.pone.0311503.g011

thumbnail
Table 5. Comparative analysis of recognition accuracy for different model architectures using different acoustic features of bowel sound signals across different window functions.

https://doi.org/10.1371/journal.pone.0311503.t005

Performance comparison of self-supervised pre-trained models

To evaluate the efficacy of self-supervised, pre-trained models in bowel sound recognition, an experimental investigation of three fine-tuning approaches was undertaken. Full parameter fine-tuning (FP) means full-scale adjustment of the pre-trained model; while low parameter fine-tuning is divided into two types, LP1 represents freezing the feature extraction layer, removing the last layer of the feature classification layer and appending a fully connected layer; LP2 further freezes the feature extraction layer, removes the last three layers of the feature classification layer and connects a fully connected layer.

The experimental results presented in Table 6 demonstrate that both HuBERT [39] and wav2vec 2.0 [40], which utilize transfer learning, exhibit strong performance in bowel sound recognition, particularly when employing low-parameter fine-tuning. Full parameter fine-tuning of these models yielded slightly inferior results, possibly due to the limited quantity of bowel sound data available, which may have hindered the models’ ability to fully extract relevant features. However, the implementation of low-parameter fine-tuning strategies LP1 and LP2 revealed that wav2vec 2.0 achieved marginally superior accuracy compared to HuBERT. This discrepancy may be attributed to wav2vec 2.0’s enhanced capability in capturing bowel sound features during the fine-tuning process. Notably, while HuBERT initially underperformed relative to wav2vec 2.0 in full parameter fine-tuning, it demonstrated significant improvement following low-parameter fine-tuning, suggesting greater adaptability and stability.

thumbnail
Table 6. Comparative analysis of recognition accuracy for different speech pre-trained models under different fine-tuning strategies on bowel sound data.

https://doi.org/10.1371/journal.pone.0311503.t006

Although the performance in full-parameter fine-tuning was slightly deficient, both HuBERT and wav2vec 2.0 showed significant improvements after implementing low-parameter fine-tuning, with wav2vec 2.0 slightly outperforming HuBERT. This suggests that when dealing with specialized tasks (such as bowel sound recognition) with limited data, model accuracy can be improved by adjusting the fine-tuning strategy. These experimental results also verified the feasibility and effectiveness of applying pre-trained models to the task of bowel sound recognition.

Discussion

The pursuit of a reliable and non-invasive gastrointestinal assessment technique through auscultation has been a formidable challenge, largely due to the labor-intensive and variable nature of manual analysis. Traditional methods are not only prone to inconsistencies but also heavily dependent on the expertise and environmental conditions of clinical practitioners. The advent of bowel sound recognition technology, while promising, has encountered its own set of challenges. These include deficiencies in automatic feature extraction and the management of long-term dependencies within models such as CNNs and LSTMs. Furthermore, the scarcity and variability of high-quality bowel sound data, influenced by sensor precision, body position, and dietary habits, have added layers of complexity to the development of reliable automated systems. This study takes on these challenges by integrating advanced attention mechanisms with self-supervised pre-training strategies, aiming to bolster bowel sound recognition under data-limited conditions.

In the domain of bowel sound recognition, conventional models like CNN, LSTM, and ResNet34 encounter intrinsic challenges due to the intricate nature of bowel sound signals. These signals’ multi-frequency and non-linear attributes can surpass the feature extraction capabilities of standard CNNs and LSTMs. Moreover, the models may struggle with the long-term dependencies inherent in bowel sound signals, especially when confronted with the risk of vanishing or exploding gradients that can arise from processing extended sequences. CNNs, proficient in capturing local features, often fall short in incorporating the broader context essential for recognizing bowel sound patterns holistically. Similarly, LSTMs, despite their sequential processing prowess, may fail to fully capture the nuanced and intricate patterns present within the signals. The quality of training data is also a critical factor; insufficient or low-quality data hampers the models’ feature discernment for accurate identification. The presence of environmental noise and inter-individual physiological variation adds to the complexity, hindering the models’ generalization and precise recognition of bowel sounds. When compounded with the risks of overfitting or underfitting, these challenges highlight an urgent requirement for more advanced models capable of accommodating the complexities of bowel sound data.

Our adoption of the Branchformer architecture for bowel sound recognition represents a paradigm shift in the field. This novel deep learning model is adept at capturing both global and local features of audio signals, offering a sophisticated alternative to the conventional CNN and LSTM models that have historically been the focus of audio signal processing research. The Branchformer’s architecture is uniquely suited to the intricate temporal dynamics of bowel sounds, providing a more advanced and nuanced approach to feature extraction and pattern recognition. The consistent outperformance of the Branchformer model across a range of experimental conditions—varying sampling rates, frame lengths, shifts, feature lengths, and window functions—underscores its robustness and adaptability. This is in stark contrast to previous studies where models often faltered in capturing the subtleties and long-term patterns of bowel sound signals. The Branchformer model’s ability to maintain high accuracy across these variables heralds a potential paradigm shift in bowel sound recognition technology, suggesting the emergence of a more reliable and sensitive diagnostic tool.

Our results align with the growing body of research that underscores the importance of deep learning in medical signal processing. However, they also highlight a departure from the status quo by demonstrating the Branchformer model’s exceptional capability in bowel sound analysis. The literature [17, 23] has reported variable success with CNNs, LSTMs, and hybrid models, but none have achieved the level of accuracy we observed with the Branchformer model. This discrepancy may be attributed to the methodological advancements in our study, particularly the use of a novel model architecture that is better suited to the complex characteristics of bowel sounds.

The incorporation of self-supervised pre-training strategies in our study, as evidenced by the performance of HuBERT and wav2vec 2.0, introduces a significant advancement. These models, initially trained on vast amounts of unlabeled data, demonstrated an impressive capacity to learn generalized sound representations. The subsequent fine-tuning on bowel sound datasets, especially with low parameter fine-tuning strategies, accentuates the potential of transfer learning in enhancing model accuracy. This approach not only mitigates the challenge of limited labeled data but also underscores the adaptability of these models to specialized recognition tasks. The performance gains observed with low parameter fine-tuning highlight the strategy’s potential in scenarios where extensive labeled data may not be available. This is particularly relevant in specialized medical fields where data acquisition can be challenging. Our study’s findings suggest that with the right pre-training and fine-tuning strategies, it is possible to achieve high recognition accuracy even with limited data, thus expanding the applicability of deep learning models in such contexts.

The present study, while yielding promising results in bowel sound recognition, is not without its limitations. The experimental conditions, though meticulously designed, may not fully account for the breadth of clinical environments in which these models will ultimately be implemented. The diversity of real-world clinical settings, with varying levels of ambient noise, patient demographics, and diagnostic protocols, presents a complex backdrop against which the robustness of our models must be tested. Furthermore, the generalizability of our findings across different populations—spanning various age groups, ethnicities, and health statuses—remains an open question. It is imperative that future research endeavors diversify the dataset and broaden the scope of experimental conditions to more accurately reflect the clinical diversity and variability encountered in practice. Moreover, the current study’s dataset, while sufficient for our initial explorations, may benefit from further expansion to include a wider array of bowel sound characteristics. This expansion could involve the collection of data from a more diverse patient population, potentially revealing previously unobserved patterns and nuances in bowel sound signals. Such an enriched dataset would not only enhance the model’s predictive power but also its adaptability to different clinical scenarios.

The Branchformer model’s demonstrated efficacy in bowel sound identification opens new horizons for clinical practice, particularly in the realm of non-invasive gastrointestinal diagnostics. Its robust performance across a range of conditions suggests a strong potential for application in a variety of clinical settings, from general practice to specialized gastroenterology units. However, the successful translation of this model from the lab to the clinic will require careful consideration of its integration into existing clinical workflows and its impact on diagnostic accuracy and patient care. Future work should prioritize the expansion of the dataset to capture a more comprehensive representation of bowel sound variability. Cross-validation across diverse demographic groups will be essential to ensure the model’s broad applicability and to identify any potential biases or limitations in its predictive capabilities. Furthermore, the practical application of these models in real-world clinical settings will necessitate rigorous testing and feedback mechanisms to refine their performance and utility.

Conclusion

This paper introduces a novel bowel sound recognition approach, which seamlessly combines the Branchformer architecture with a self-supervised pre-training strategy. Our experiments underscore the method’s considerable efficacy in bowel sound recognition, particularly in scenarios characterized by a paucity of data. The Branchformer model’s ability to concurrently process global and local features significantly bolsters its capacity for discerning bowel sound signals. The incorporation of a self-supervised pre-training strategy not only mitigates the issue of data scarcity but also tailors the model’s performance to specific tasks via fine-tuning. Furthermore, our study elucidates the influence of key parameters—such as sampling rate, frame length, frame shift, and the choice of window function—on the model’s accuracy, offering a foundation for subsequent research endeavors. While our findings are encouraging, the scope for refinement remains. Future endeavors may delve into alternative feature representations and pre-training methodologies to augment the model’s discriminative power and its applicability across diverse datasets.

References

  1. 1. Liu CJ, Huang SC, Chen HI. Oscillating gas bubbles as the origin of bowel sounds: A combined acoustic and imaging study. Chin J Physiol. 2010;53:245–253. pmid:21793334
  2. 2. Fox MR, Kahrilas PJ, Roman S, Gyawali CP, Scott SM, Rao SS, et al. Clinical measurement of gastrointestinal motility and function: who, when and which test? Nature reviews Gastroenterology & hepatology. 2018;15(9):568–579. pmid:29872118
  3. 3. Durup-Dickenson M, Christensen MK, Gade J. Abdominal auscultation does not provide clear clinical diagnoses. Surgical gastroenterology. 2013;18(32):2.
  4. 4. Felder S, Margel D, Murrell Z, Fleshner P. Usefulness of bowel sound auscultation: a prospective evaluation. Journal of surgical education. 2014;71(5):768–773. pmid:24776861
  5. 5. Mansy H, Sandler R. Detection and analysis of gastrointestinal sounds in normal and small bowel obstructed rats. Medical and Biological Engineering and Computing. 2000;38:42–48. pmid:10829389
  6. 6. Nowak JK, Nowak R, Radzikowski K, Grulkowski I, Walkowiak J. Automated bowel sound analysis: an overview. Sensors. 2021;21(16):5294. pmid:34450735
  7. 7. Ogino Y, Satoh Y, Sakata O. Forecasting bowel sound occurrence frequency by SARIMA model. In: 2019 23rd International Computer Science and Engineering Conference (ICSEC). IEEE; 2019. p. 219–223.
  8. 8. Yin Y, Jiang H, Yang W, Wang Z. Intestinal motility assessment based on Legendre fitting of logarithmic bowel sound spectrum. Electronics Letters. 2016;52(16):1364–1366.
  9. 9. Ulusar UD. Recovery of gastrointestinal tract motility detection using Naive Bayesian and minimum statistics. Computers in biology and medicine. 2014;51:223–228. pmid:24971526
  10. 10. Mohnani A, Mohnani A, Eisenfeld L. Electronic Stethoscope System for Monitoring and Analyzing Bowel Sounds in Real-Time. Connecticut Medicine. 2018;82(7).
  11. 11. Wang G, Wang M, Liu H, Zhao S, Liu L, Wang W. Changes in bowel sounds of inpatients undergoing general anesthesia. BioMedical Engineering OnLine. 2020;19:1–13. pmid:32731903
  12. 12. Dimoulas C, Kalliris G, Papanikolaou G, Petridis V, Kalampakas A. Bowel-sound pattern analysis using wavelets and neural networks with application to long-term, unsupervised, gastrointestinal motility monitoring. Expert Systems with Applications. 2008;34(1):26–41.
  13. 13. Kim KS, Seo JH, Song CG. Non-invasive algorithm for bowel motility estimation using a back-propagation neural network model of bowel sounds. Biomedical engineering online. 2011;10:1–10. pmid:21831291
  14. 14. Yin Y, Yang W, Jiang H, Wang Z. Bowel sound based digestion state recognition using artificial neural network. In: 2015 IEEE Biomedical Circuits and Systems Conference (BioCAS). IEEE; 2015. p. 1–4.
  15. 15. Lin BS, Sheu MJ, Chuang CC, Tseng KC, Chen JY. Enhancing bowel sounds by using a higher order statistics-based radial basis function network. IEEE Journal of Biomedical and Health Informatics. 2013;17(3):675–680. pmid:24592467
  16. 16. LeCun Y, Bengio Y, et al. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks. 1995;3361(10):1995.
  17. 17. Zhao K, Jiang H, Wang Z, Chen P, Zhu B, Duan X. Long-term bowel sound monitoring and segmentation by wearable devices and convolutional neural networks. IEEE Transactions on Biomedical Circuits and Systems. 2020;14(5):985–996. pmid:32833642
  18. 18. Wang N, Testa A, Marshall BJ. Development of a bowel sound detector adapted to demonstrate the effect of food intake. BioMedical Engineering OnLine. 2022;21:1–12. pmid:34983542
  19. 19. Zhao K, Jiang H, Yuan T, Zhang C, Jia W, Wang Z. A CNN based human bowel sound segment recognition algorithm with reduced computation complexity for wearable healthcare system. In: 2020 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE; 2020. p. 1–5.
  20. 20. Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997;9(8):1735–1780. pmid:9377276
  21. 21. Liu J, Yin Y, Jiang H, Kan H, Zhang Z, Chen P, et al. Bowel sound detection based on MFCC feature and LSTM neural network. In: 2018 IEEE Biomedical Circuits and Systems Conference (BioCAS). IEEE; 2018. p. 1–4.
  22. 22. Sitaula C, He J, Priyadarshi A, Tracy M, Kavehei O, Hinder M, et al. Neonatal bowel sound detection using convolutional neural network and Laplace hidden semi-Markov model. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2022;30:1853–1864.
  23. 23. Ficek J, Radzikowski K, Nowak JK, Yoshie O, Walkowiak J, Nowak R. Analysis of gastrointestinal acoustic activity using deep neural networks. Sensors. 2021;21(22):7602. pmid:34833679
  24. 24. Kutsumi Y, Kanegawa N, Zeida M, Matsubara H, Murayama N. Automated bowel sound and Motility analysis with CNN using a smartphone. Sensors. 2022;23(1):407. pmid:36617005
  25. 25. Sheu MJ, Lin PY, Chen JY, Lee CC, Lin BS. Higher-order-statistics-based fractal dimension for noisy bowel sound detection. IEEE Signal Processing Letters. 2014;22(7):789–793.
  26. 26. Allwood G, Du X, Webberley KM, Osseiran A, Marshall BJ. Advances in acoustic signal processing techniques for enhanced bowel sound analysis. IEEE reviews in biomedical engineering. 2018;12:240–253. pmid:30307875
  27. 27. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017;30.
  28. 28. Niu Z, Zhong G, Yu H. A review on the attention mechanism of deep learning. Neurocomputing. 2021;452:48–62.
  29. 29. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:201011929. 2020;.
  30. 30. Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision; 2021. p. 10012–10022.
  31. 31. Chan W, Jaitly N, Le Q, Vinyals O. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2016. p. 4960–4964.
  32. 32. Radford A, Narasimhan K, Salimans T, Sutskever I, et al. Improving language understanding by generative pre-training; 2018. Available from: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
  33. 33. Devlin J, Chang MW, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018;.
  34. 34. Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, et al. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:191013461. 2019;.
  35. 35. Corbetta M, Shulman GL. Control of goal-directed and stimulus-driven attention in the brain. Nature reviews neuroscience. 2002;3(3):201–215. pmid:11994752
  36. 36. Baronetto A, Graf LS, Fischer S, Neurath MF, Amft O. Segment-based spotting of bowel sounds using pretrained models in continuous data streams. IEEE Journal of Biomedical and Health Informatics. 2023;27(7):3164–3174. pmid:37155392
  37. 37. Peng Y, Dalmia S, Lane I, Watanabe S. Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding. In: International Conference on Machine Learning. PMLR; 2022. p. 17627–17643.
  38. 38. Gulati A, Qin J, Chiu CC, Parmar N, Zhang Y, Yu J, et al. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:200508100. 2020;.
  39. 39. Hsu WN, Bolte B, Tsai YHH, Lakhotia K, Salakhutdinov R, Mohamed A. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing. 2021;29:3451–3460.
  40. 40. Baevski A, Zhou Y, Mohamed A, Auli M. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems. 2020;33:12449–12460.
  41. 41. McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, et al. librosa: Audio and music signal analysis in python. In: SciPy; 2015. p. 18–24.
  42. 42. He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14. Springer; 2016. p. 630–645.
  43. 43. Ba JL, Kiros JR, Hinton GE. Layer normalization. arXiv preprint arXiv:160706450. 2016;.
  44. 44. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research. 2014;15(1):1929–1958.
  45. 45. Sakuma J, Komatsu T, Scheibler R. MLP-based architecture with variable length input for automatic speech recognition; 2022. Available from: https://openreview.net/forum?id=RA-zVvZLYIy.
  46. 46. Hendrycks D, Gimpel K. Gaussian error linear units (gelus). arXiv preprint arXiv:160608415. 2016;.
  47. 47. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–778.