Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Benchmarking machine learning for bowel sound pattern classification – From tabular features to pretrained models

  • Zahra Mansour,

    Roles Data curation, Software, Writing – original draft

    Affiliations Division AI4Health, Department for Health Services Research, Faculty of Medicine and Health Sciences, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany, Fraunhofer IDMT, Institute Part HSA, Oldenburg, Germany

  • Verena Nicole Uslar,

    Roles Resources, Writing – review & editing

    Affiliation University Clinic for Visceral Surgery, Faculty of Medicine and Health Sciences, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany

  • Dirk Weyhe,

    Roles Resources, Writing – review & editing

    Affiliation University Clinic for Visceral Surgery, Faculty of Medicine and Health Sciences, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany

  • Danilo Hollosi,

    Roles Funding acquisition, Project administration, Supervision, Writing – review & editing

    Affiliation Fraunhofer IDMT, Institute Part HSA, Oldenburg, Germany

  • Nils Strodthoff

    Roles Methodology, Project administration, Supervision, Validation, Writing – review & editing

    nils.strodthoff@uni-oldenburg.de

    Affiliation Division AI4Health, Department for Health Services Research, Faculty of Medicine and Health Sciences, Carl von Ossietzky Universität Oldenburg, Oldenburg, Germany

Abstract

The development of electronic stethoscopes and wearable recording sensors opened the door to the automated analysis of bowel sound (BS) signals. This enables a data-driven analysis of bowel sound patterns, their interrelations, and their correlation with different pathologies. This work leverages a BS dataset collected from 16 healthy subjects that was annotated according to four established BS patterns. This dataset is used to evaluate the performance of machine learning models to detect and/or classify BS patterns. The selection of considered models covers models using tabular features, convolutional neural networks based on spectrograms and models pre-trained on large audio datasets. The results highlight the clear superiority of pre-trained models, particularly in detecting classes with few samples, achieving an AUC of 0.89 in distinguishing BS from non-BS using a HuBERT model and an AUC of 0.89 in differentiating bowel sound patterns using a Wav2Vec 2.0 model. These results pave the way for more comprehensive understanding of bowel sounds in general and future machine-learning-driven diagnostic applications for gastrointestinal examinations.

Introduction

Bowel sound auscultation represents an essential non-invasive physical examination technique. Although it is part of the abdominal examination routine worldwide, its medical value remains limited and subjective [1]. This shortcoming mainly concerns the unspecified nature of the bowel sound (BS) signal, its irregular occurrence with comparably long typical time frames, and its dependency on the diet. These aspects make it challenging to identify and characterize the typical waveform or pattern of the BS signal [2].

.

History of BS analysis.

Analysing the BS was first introduced in the early twentieth century by Cannon [3], who described two types of BS: rhythmic sounds related to the movement of the intestine and continuous random sounds found at different locations within the bowel [3]. Since human auscultation depends on the expertise of the clinician, it cannot reliably distinguish between normal and pathological BS [1], therefore following studies tended to automate the analysis of the BS by extracting time domain-related features [4], such as amplitude, duration, signal to signal interval and total number of sounds per minute [47], aiming to discriminate how those features might vary in the case of some diseases such as irritable bowel syndrome (IBS) [8,9] [10,11], obstruction [12,13] and post-operative illus [14,15].

Automatic BS analysis.

The scope of BS analysis ranges from simply enhancing the quality of the recorded BS [11], to detecting BS events in the abdominal recording [16], and in some cases to identifying different kinds of BS patterns [17].

Research gap.

Despite the progress in BS signal detection and pattern recognition, a significant research gap remains. Deep learning (DL) and transfer learning models have notably advanced audio classification tasks, particularly in the analysis of heart and lung sounds, see [18,19] for recent methodological reviews. However, their application to bowel sound classification is still limited and underexplored. Only a few recent studies have employed DL models to detect BS signals [2022]. However, no study has yet explored the use of DL or transfer learning for classifying BS patterns. Addressing this gap could unlock new insights into the relationship between BS patterns and gastrointestinal function or pathology.

Summary of this study.

In this study we investigate which machine learning models are best suited to address the challenges of bowel sound classification and to what extent pretrained models improve classification performance on small bowel sound datasets. By comparing the performance between tabular features extraction models, deep learning models (VGG 19, ResNet 50, and AlexNet) with different spectrograms as input (Standard, Log-Mel, and MFCC), and transfer learning based models (Wav2Vec 2.0, HuBERT, and VGGish). In two different tasks, distinguish between non-BS and BS signals, and classify between non-BS signals and the BS patterns(SB, MB, CRS, and HS). The methodological framework underlying this study is summarized in Fig 1.

thumbnail
Fig 1. Schematic diagram summarizing the course of the study.

It starts by recording the BS by placing the sensor on the subjects’ abdomen, proceeds to labeling the signal into non-BS and BS patterns (SB, MB, CRS, HS), and segmenting the signal into 2 seconds overlapped windows, to use it later on the classification with 3 different methods (using tabular features, using spectrogram and using pre-trained models).

https://doi.org/10.1371/journal.pone.0338911.g001

Materials and methods

Background

.

Physiological origin.

It has been established that two main sources produce BS: the contractions and relaxations of the abdominal wall smooth muscle which leads to pushing the intestines contents through the gastrointestinal (GI) tract, and the interaction of the gas and the partially digested food through the intestines [23]. This explanation has been ensured by electromyography and anatomy research, which showed that the intestinal contraction activity appears as a burst or explosive peak (very short contraction) or as clustered contractions with regular release and slope duration [2]. In light of this explanation, bowel sounds’ most observed patterns can be divided into two main groups: quick clicks that could appear as a single or a sequence of bursts and longer contractions with rubbing or piping noises.

Investigated BS patterns.

Therefore, the bowel sound patterns investigated in this study combine the types described within the abdominal auscultation medical terms , in addition to the bowel sound patterns that have been mentioned in the previous studies [2426]. The chosen patterns align with existing studies and the physiological phenomena described by experts during abdominal auscultation. They are also exhaustive, i.e., all occurring sound events in the dataset can be categorized as one of the considered patterns. The bowel sound patterns in the time and frequency domain are presented in Fig 2 and can be categorized into four groups:

  • Single Burst (SB)/Solitary Clicks: SB is the most frequent bowel sound type, it has a damped sinusoidal-like nature and is shown as pulses with significantly reduced energy. SB is likely caused by small GI contractions or splashes of the digestive content. It is mostly characterized by very short duration, ranging between 10–30 ms, with a frequency peak around 400 Hz Fig 2a.
  • Multiple Burst (MB)/Repeated Clicks: MB is represented by multiple SBs with a short, inconsistent silent gap between adjacent components. It might be produced by mixing the food and gas inside the intestine. Each component in the MB spectrogram looks quite similar, with slight differences in bandwidth and amplitude. MB is longer than SB, with a duration ranges from 40 to 1500 ms Fig 2b.
  • Continuous Random Sound (CRS)/Crepitating Sweeps: CRS contains a main clustered contraction, sounding like crepitating or rumbling, which is probably caused by pushing the fluid and gas through the intestine. CRS has a comparably longer duration from 200 ms to 4000 ms without any noticeable silent gaps, and a higher frequency range 500-1700 Hz, see Fig 2c.
  • Harmonic Sound (HS)/Whistling Sweeps: This is the least frequent bowel sound pattern, composed of three to four clear frequency components that appear in the spectrogram a which cause piping notes with a whistling-like sound. The formation of intestinal stenosis, or gastric activity produces it. HS duration ranges from 50 ms to 1500 ms, while HS spectrogram contains peak frequencies as multiples of the fundamental frequency, which is usually relatively low, around 200 Hz, see Fig 2d.
thumbnail
Fig 2. Bowel sound patterns examples extracted from the dataset used in this study, the left column represents the signal in the time domain, and the right column describes the signal in the frequency domain.

Starting from the top, (a) Single Burst (SB), (b) Multiple Burst, (c) Continuous Random Sound (CRS), and (d) Harmonic Sound (HS).

https://doi.org/10.1371/journal.pone.0338911.g002

Related work

.

Signal processing.

The majority of studies investigating bowel sounds primarily focus on extracting and enhancing the BS signal using advanced signal processing techniques. Examples include the Wavelet Transform-Based Stationary-Non-Stationary Filter [11,27], Short Time Fourier Transformation [28], and Fractal Dimension [17]. Similarly, machine learning techniques, such as the use of jitter and shimmer parameters [29], neural networks [30,31], and support vector machines [32], have been employed for BS signal processing and analysis.

BS identification and classification.

While these methods provide robust tools for signal detection and enhancement, only a limited number of studies have focused on identifying and classifying BS patterns. The early work by Dimoulas et al. [33] introduced abdominal sound pattern analysis (ASPA), combining BS time-energy alterations with electrical or pressure signals. This study marked the first step toward understanding BS patterns in the context of gastrointestinal function.

Machine learning approaches.

Subsequent studies faced the difficulty that bowel sound signals are inherently complex, exhibiting non-linear and highly variable characteristics. This makes them particularly challenging for traditional machine learning methods. For example, Kumar et al. [34] achieved an accuracy of 75% using MFCC features and an SVM classifier in a noise-free environment, highlighting the limitations of classical approaches in capturing the full dynamics of bowel sounds. Similarly, since convolutional neural networks (CNNs) are effective at extracting local features, they often struggle to capture the broader temporal context needed to recognize complete bowel sound patterns. Zhao et al. [22], for instance, reported an accuracy of 76.8% using a CNN-based approach.

Pretrained models.

More recent work has explored pretrained models in the analysis of bodily sounds. Panah et al. [35] investigated Wav2Vec for heart sound classification and found that fine-tuning even with small datasets yielded robust results, suggesting that such models can reduce the need for large, annotated medical datasets. In the context of bowel sound detection, Pretrained models were first introduced for BS identification by Baronetto et al., where various deep neural networks (DNNs) were developed for BS recognition [16]. Later, Yu et al. [36] demonstrated the ability of Wav2Vec to effectively model the temporal structure of bowel sounds, achieving an accuracy of 85.3%. These findings point to the potential of pretrained models to provide more nuanced and adaptive feature extraction compared to traditional or earlier deep learning methods.

Research gap.

Nevertheless, only a very limited number of studies have gone one step further to investigate not only the detection of bowel sounds but also their classification into main patterns. One of the very first studies employed a wavelet neural network to differentiate between two BS patterns and three types of interfering noises [17]. Another significant study analyzed two hours of BS recordings from ten participants, identifying five distinct BS types based on waveform and spectrogram characteristics. This study also investigated inter-subject variations in the duration and proportion of these patterns [24]. Similarly, research involving 1,140 BS recordings from 15 volunteers identified four unique BS patterns [37].

More recently, advancements in deep learning have facilitated new approaches to BS analysis. A CNN-based detector was developed to identify four types of BS and to investigate the acoustic effects of food consumption. This study proposed that total BS duration increases post-consumption, highlighting the relevance of studying BS patterns in relation to the digestion cycle [26].

To the best of our knowledge, no prior research has examined the use of transfer learning in supervised techniques for both BS identification and BS pattern classification while at the same time providing a comprehensive benchmark across different input representations.

Dataset

.

Recording protocol.

The research protocol underlying the data recordings was approved by the Medical Ethics Committee of the University of Oldenburg (2022-056). Before participation, subjects were informed about the purpose of the study and consented to the use of the collected data in an anonymized form. Additionally, four subjects consented to make their data publicly available. The dataset can be accessed at the following link [38].

During the recording session, each subject was asked to lie down in a supine position. the BS were then recorded using three auscultation sensors: 3M Littmann Digital CORE, Thinklabs One Digital Stethoscope, and SonicGuard sensor [39]. Each sensor was placed at the umbilical region of the abdominal wall individually and performed 10 minutes recordings.

Participant information.

Sixteen healthy subjects (8 male, 8 female) participated in the study carried out at PIUS Hospital in Oldenburg, Germany. To ensure diversity in BS characteristics, participants were selected with a wide range of meal timings before recording, ranging from 30 minutes to 12 hours, covering an age distribution from 20 to 51 years old (mean: 28.6 5.5 years). All subjects confirmed to be free from any digestive system-related diseases, ensuring that recorded bowel sounds were representative of normal physiological conditions.

Annotation.

At the beginning of the study, a medical expert provided training to the staff responsible for creating the labeling instructions, Then recorded data were manually labeled into non-BS signals and BS signals. The BS signals were further subdivided into bowel sound patterns: SB, MB, CRS, and HS. The distribution of the classes in the dataset showed that SB is the most frequent BS class, representing 58.53% of the BS signals, followed by MB (21.78%) and CRS (17.24%), while the least common class is HS, which accounted for only 2.45% of the events. This distribution is consistent on a subject level, as represented in Fig 3, where SB patterns are present in all subjects’ signals. Similarly, MB and CRS are observed, though to a lesser extent, while HS appears in only 56% of the subjects.

thumbnail
Fig 3. Distribution of bowel sound pattern (SB, MB, CRS, HS) counts by subjects, the box represents the interquartile range (IQR), with the horizontal line inside the box indicating the median.

Whiskers extend to 1.5 IQR, and points outside the whiskers represent outliers. The SB group shows the highest median and variability, while the HS group has the smallest counts and minimal variability.

https://doi.org/10.1371/journal.pone.0338911.g003

Dataset preparation.

After collecting the BS and annotating it according to the BS patterns, each wave signal was divided into smaller chunks by applying a sliding window with a 2-second duration and a stride of 1s, i.e., with 50% overlap. For each segment, the relative duration of each pattern is calculated by dividing its duration by the total segment length. The segment label is then assigned based on the pattern with the highest relative duration, which is considered the dominant pattern within that segment. The distribution of the classes within the dataset before and after the segmentation is represented in the first column in Fig 4. The amount of SB samples intensely dropped because of the segmentation, while the non-BS class samples increased. This could be attributed to the short duration of SB, which is smaller than the segmentation window (2 sec), making it less likely for SB to be the dominant event.

thumbnail
Fig 4. The first column shows the distribution of the classes (non-BS, SB, MB, HS) within the dataset before and after the segmentation using 2 2-second, overlapping window and results in a reduced count of patterns, particularly for short-duration patterns like SB and HS that are shorter than the window size.

The second column shows the distribution of the 5 classes within the train, validation, and test sets by using random and stratified splits. The distribution of the classes is closer to the required ratio (70%, 15%, 15%) using stratified splitting.

https://doi.org/10.1371/journal.pone.0338911.g004

Dataset splits.

Before building the model, the dataset is split into the train, validation, and test sets (70%, 15%, 15%) respectively. Because of the unequal distribution of classes within the dataset, performing random splitting produces non-homogeneous class distributions across the three sets. Thus, stratified splitting is applied to produce samples where the proportion of the groups is maintained. To this end, we use the stratified splitting procedure from [40,41], which supports multi-label stratification while respecting patient assignments. The result of the stratification process is visualized in Fig 4.

Experimental setup.

Experiments were conducted on single NVIDIA L40 GPUs. The environment used Conda, PyTorch 2.1.0 (with CUDA 11.8), and Transformers 4.39.3, along with common libraries for audio processing, data analysis, and visualization.

Considered classification models

The process undertaken in this study is visually summarized in Fig 1. We roughly distinguish three approaches: Tree-based classifiers applied to tabular features, convolutional neural networks applied to spectrograms, and finetuned pre-trained models operating on spectrograms or raw waveform input.

Classification using tabular features.

Feature sets.

Two sets of features were used from the openSMILE (open-source Speech and Music interpretation by Large-space Extraction) toolkit. The first one is the Geneva Minimalistic Acoustic Parameter Set (GeMAPS a.v01) for Voice Research and Affective Computing is a set of 62 acoustic features developed to extract emotional states from speech. It contains frequency, energy, position, and temporal characteristics [42]. The second set is ComParE 2016 (Computational Paralinguistics Challenge), which is tolerated for the Computational Paralinguistics Challenge, it covers features related to phonation, articulation, and spectral shape, contains 6300 acoustic features such as Loudness, MFCCs(Mel frequency cepstral coefficients), temporal and spectral features [43].

Gradient-boosting decision trees.

Each set of features is then used separately as an input with two gradient boosting frameworks, XGBoost (Extreme Gradient Boosting) [44] and CatBoost(Categorical Boosting) [45]. XGBoost and CatBoost are used with the following hyperparameters: max depth = 7, learning rate = 0.001, and number of iterations = 50.

Classification using spectrogram.

This approach relies on converting 1D raw audio signals into 2D visual data represented by the spectrogram, allowing the use of deep learning architectures from computer vision for audio classification. Three types of spectrograms are tested:

Standard spectrogram.

The standard spectrogram gives a visual representation of how the frequency components of the signal evolve over time, and it is obtained by calculating the squared magnitude of the Short Time Fourier Transform (STFT) of the wave segment. The STFT is given by [46]

(1)

where w is hamming function applied to the signal to divide it into smaller portions in the time domain. The standard spectrogram is then computed as the power at each time-frequency point, i.e., as S(t,f) = |X(t,f)|2.

Log Mel-Spectrogram.

The Log Mel-Spectrogram reflects how energy is distributed over frequency and time, by mimicking human auditory perception by applying a log transformation that compresses the dynamic range of the spectrogram and highlights the lower amplitude features, which makes the data more balanced for downstream tasks. Log Mel-Spectrogram is calculated by passing The power spectrogram |X(f,t)|2 through a series of Mel filter banks to convert the frequency scale according to the following equation [47]:

(2)

where Hk(f) represents the triangular filters that map linear frequencies to the Mel scale. then the log of the Mel-spectrogram is computed via

(3)

Mel-Frequency Cepstral Coefficients (MFCC).

The last representation is the MFCC spectrogram, which is widely used in audio processing due to its ability to capture essential spectral properties of the sound and represent it in a compact form. It is calculated by applying a Mel-spectrogram (with the same STFT and filter bank configuration) as mentioned before and then applying a discrete cosine transform (DCT) to compute the first 13 cepstral coefficients, which provides a lower-dimensional representation by returning significant information while discarding less relevant details. It is defined by the following equation [48],

(4)

where M is the total number of Mel frequency bins.

Each one of the mentioned spectrograms is then used separately as an input to the following four deep-learning models:

AlexNet.

The AlexNet model was introduced by Krizhevsky et al. in the 2012 [49], and demonstrated the potential of deep learning in computer vision. It represents a convolutional neural network designed to learn hierarchical features from raw pixel data, consisting of eight layers; five convolutional layers with large filters, with a Rectified Linear Unit (ReLU) applied after each one of them to help the network learn more complex patterns, followed by three fully connected layers to make the predictions.

VGG19.

The VGG19 model was originally introduced by the Visual Geometry Group (VGG) at the University of Oxford in 2014 [50]. It consists of 16 layers: 13 convolutional layers for features extraction followed by 3 fully connected layers for classification. Each convolutional layer uses a small filter (3x3) with a stride of 1 to learn various spatial hierarchies of features, while the pooling layers use a 2x2 kernel, and a stride of 2 is used to reduce the spatial dimensions of the feature maps.

ResNet50.

The Residual Networks architecture, proposed by He et al. in 2015 [51], introduced the concept of residual connections. Usually, ResNet starts with a 7x7 convolutional layer for feature extraction, followed by a series of residual blocks. In this study, we use a ResNet50 model. AlexNet, VGG19, and ResNet50 models that have been used in this study, were initialized with weights pre-trained on the ImageNet dataset.

CNN-LSTM.

For the last model, two neural network architectures were combined: Convolutional Neural Networks (CNNs) for extracting the spatial features and Long Short-Term Memory (LSTM) networks for capturing the temporal dependencies in the data by using the memory cells, which allow the network to retain and update information over long periods [52]. In this study, the CNN-LSTM model consists of 2 CNN layers with a kernel size of 3, a stride of 1, and padding of 1, in addition to Max-pooling applied after each convolution. They are followed by an LSTM, which consists of 2 layers stacked together with 128 hidden features. The LSTM layer outputs a sequence of hidden states for each time step, and the final hidden state is then passed through the fully connected layer to produce the final classification output.

All the mentioned models were trained for 20 epochs, with a 0.0001 learning rate, using Cross Entropy Loss and an Adam optimizer.

Classification using transfer learning.

Transfer learning is a machine learning technique where a model is trained on a large dataset and learns a set of features, then it is fine-tuned and reused as a starting point for a model with a smaller but related dataset. This technique was reported to reduce computational cost and to lead to better performance in the related field of heart sound analysis, in particular for small sample sizes [53]. In this study, three models that were trained on a large audio dataset are used as a feature extractor:

Wav2Vec 2.0.

Wav2Vec 2.0 [54] is a transformer model employed by a CNN to encode raw audio into features and it is trained using a contrastive loss function to perform self-supervised pre-training on unlabeled speech data. Wav2Vec masks portions of the input audio data during the training and learns to predict these masked segments based on the surrounding context.

HuBERT.

(Hidden-Unit Bidirectional Encoder Representations from Transformers) is a transformer-based model, again operating on features extracted from the raw waveform data using a shallow CNN [55]. Similar to wav2Vec, it uses a masking strategy, but HuBERT introduces Hidden Units by generating initial labels on the first training and then improving them from the model’s output in the following training phases.

Both models were only trained on excessively large speech datasets, but the effectiveness of the learned representations has been demonstrated for general audio classification tasks [55]

VGGish.

VGGish was developed as an adaptation of the VGG network to extract features from a raw audio signal [56]. It has been trained in a supervised fashion on AudioSet, containing over 2 million human-labeled 10-second audio segments covering variant sound events. While the original VGG networks processed 2D image data, VGGish processes log-mel spectrograms of the audio signal as input. VGGish consists of a series of convolutional layers, each followed by ReLU activations and max-pooling layers, followed by fully connected layers to compress the extracted features into a 128-dimensional embedding vector.

Evaluation

We base the evaluation on the area under the receiver operating curve (AUC), as a commonly used ranking metric to assess the overall discriminative power of classification algorithms without the necessity to set a decision threshold. For binary classification, AUC is computed using the Receiver Operating Characteristic (ROC) curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR).

As we are dealing with multi-class classification, AUC is calculated for each class against all other classes (One-vs-Rest approach). Then the overall AUC is computed by averaging the AUC for all individual classes. Multi-classes AUC calculation is given by:

(5)

where C is the number of classes and refers to the AUC for class i.

Results and discussion

Classifying between non-BS and BS

The AUC values of the binary classification between non-BS and BS signals using the different models are listed in Fig 5.

thumbnail
Fig 5. The AUC values of the binary classification between non-BS and BS signal are listed as follows: using tree-based models on tabular features(green; bars from 1 to 4), Spectrogram-based models (blue; bars from 5 to 16), and transfer learning based features (orange; bars from 17 to 19).

The figure demonstrates that pretrained models offer more reliable performance compared to other techniques, while feature-based models struggle with the bowel sound detection task.

https://doi.org/10.1371/journal.pone.0338911.g005

.

Tabular features.

Using tabular input features, both feature libraries (GeMAPS and ComParE) showed similar performance, with slightly better results achieved by using CatBoost (AUC = 0.7) compared to XGBoost (AUC = 0.69).

Spectrograms.

Turning to spectrogram-based models, all the DL models (VGG19, ResNet50, AlexNet, and CNN-LSTM) exhibit similar levels of performance with the lowest AUC obtained by using a Standard spectrogram with CNN-LSTM (AUC = 0.71), while the highest AUC is achieved by using MFCC with ResNet50 (AUC = 0.78). Among the spectrograms, MFCC shows better performance compared to the two other types of spectrograms in the majority of cases.

Transfer learning.

Finally, concerning classification using transfer learning, all the models that have been pre-trained on a large dataset and then fine-tuned to be used for classification between BS and non-BS demonstrate the best performance among all the other models. The highest AUC is reached by using HuBERT, followed by Wav2Vec2.0 and VGGish (0.89, 0.85, 0.85) respectively. This effectiveness can be attributed to the fact that the pretrained model starts with weights already optimized on a large dataset, requiring fewer adjustments during finetuning.

Classification between non-BS and BS patterns (SB, MB, CRS, HS)

In this section, the results of classifying between 5 classes (non-BS, SB, MB, CRS, and HS) are presented. The AUC of each class and the overall AUC are listed in Table 1. A bar chart showing the AUC value for each class, along with the overall AUC of the best model from each group of classifiers is presented in Fig 6.

thumbnail
Fig 6. The binary AUC values for each class vs the rest (NONE vs all, SB vs all, MB vs all, CRS vs all, and HS vs all) and the overall AUC, achieved by the best models within each technique group.

This diagram highlights that pretrained models provide stable performance across all bowel sound pattern classes, whereas other techniques face greater challenges, particularly with patterns that have fewer samples, such as SB and HS.

https://doi.org/10.1371/journal.pone.0338911.g006

thumbnail
Table 1. The binary AUC values for each class vs the rest (NONE vs all, SB vs all, MB vs all, CRS vs all, and HS vs all) and the overall AUC, using tabular features, spectrograms, and transfer learning-based models.

The best performance achieved by the Wav2Vec model for all classes.

https://doi.org/10.1371/journal.pone.0338911.t001

.

Tabular features.

Among the classifiers operating on tabular features, CatBoost showed the best performance with both feature sets (GeMAPS and ComParE) with overall AUC (of 0.72 and 0.6) respectively.

Spectrograms.

Turning to spectrogram-based classifiers and comparing the three spectrograms (standard spectrogram, log-Mel, and MFCC), MFCC appears to capture more useful and discriminative information compared to other spectrogram types, leading to the model achieving superior performance. The most powerful model is CNN-LSTM, followed by ResNet50, VGG19, and Alexnet with AUC using MFCC as an input (0.75, 0.74, 0.72, 0.7) respectively, which means for classifying between the different BS patterns, it is more useful to involve temporal dynamics or sequential dependencies in addition to spatial patterns. Moreover, CNN-LSTM showed more reliability in classifying the classes with the smallest sample sizes, such as SB (AUC-0.71) using MFCC as an input, in comparison to VGG19, ResNet50, and AlexNet with AUC (0.58, 0.58, 0.56) using the same input.

Transfer learning.

Similar to the binary classification, using pre-trained models demonstrates the best performance across all the other models with the highest AUC achieved by Wav2Vec2.0, followed by HuBERT and VGGish (0.89, 0.86, 0.79) respectively. It is noteworthy that all the pre-trained models showed reliable performance with all the classes regardless of the class sample size, which is highly effective in addressing challenges associated with small datasets. This observation is clearly illustrated in Fig 6, which presents the AUC values for each class achieved by the best models within each group. Even for the best-performing models, a significant drop in performance can be observed in tabular feature-based and spectrogram-based models for classes with smaller sample sizes (e.g., SB and HS) compared to those with larger sample sizes (e.g., None, MB, CRS). However, this limitation is effectively resolved by using pre-trained models.

Discussion

.

Main finding.

Our findings provide further support to the studies mentioned in section (Related work), which suggest that traditional machine learning techniques encounter difficulties due to the complex nature of bowel sound signals, whereas pretrained models offer promising tools to overcome these limitations. In our work, pretrained models, particularly Wav2Vec and HuBERT, proved to be the most effective among those tested, delivering consistent and reliable performance in both bowel sound detection and pattern classification tasks. A possible explanation is that pretraining on large speech corpora enables these models to learn generalizable acoustic representations, some of which may transfer to non-speech sounds, including low-level acoustic features such as pitch and tone, which are also relevant in bodily sound analysis. Unlike previous studies that primarily focused on detecting the presence of bowel sounds, our study additionally explored the classification of specific bowel sound patterns. Using a different dataset, we achieved a comparable AUC of 0.89 in the detection task. To our knowledge, this is the first study to evaluate deep learning models on both bowel sound detection and pattern classification within a single framework.

Limitations.

Since our dataset included only healthy subjects, further research is needed to evaluate the model’s generalizability to more diverse populations. In particular, future studies should assess whether the model maintains its performance when applied to data from patients with gastrointestinal pathologies or to datasets containing a mix of healthy and patient recordings. This would help determine the robustness of the model in the presence of pathological variations in bowel sound characteristics.

Conclusion

Detecting bowel sound activity and patterns has been a significant challenge over the last century. The lack of high-quality data further complicates this task for traditional machine learning models. This work demonstrated the feasibility of differentiating different kinds of bowel sound patterns and provided a comparative assessment of different machine-learning approaches, ranging from decision trees on tabular expert features, over CNNs on spectrums to leveraging models pre-trained on large audio datasets. The best-performing models originated from the latter category, emphasizing the promising role of pre-trained models to overcome the challenges associated with small datasets. These advances pave the way for a better understanding of the value of bowel sound monitoring in gastrointestinal examinations, in particular for further studies to explore how these patterns correlate with gastrointestinal tract diseases.

While the raw dataset underlying this study cannot be shared due to consent restrictions, the full implementation of our approach, along with training scripts, is available in a corresponding code repository https://github.com/AI4HealthUOL/bowel-sound-classification.

References

  1. 1. Felder S, Margel D, Murrell Z, Fleshner P. Usefulness of bowel sound auscultation: a prospective evaluation. J Surg Educ. 2014;71(5):768–73. pmid:24776861
  2. 2. Nowak JK, Nowak R, Radzikowski K, Grulkowski I, Walkowiak J. Automated bowel sound analysis: an overview. Sensors (Basel). 2021;21(16):5294. pmid:34450735
  3. 3. Cannon WB. Auscultation of the rhythmic sounds produced by the stomach and intestines. American Journal of Physiology-Legacy Content. 1905;14(4):339–53.
  4. 4. Dalle D, Devroede G, Thibault R, Perrault J. Computer analysis of bowel sounds. Comput Biol Med. 1975;4(3–4):247–56. pmid:1139905
  5. 5. Bray D, Reilly R, Haskin L, McCormack B. Assessing motility through abdominal sound monitoring. In: Proceedings of the 19th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. vol. 6; 1997. p. 2398–400.
  6. 6. Ranta R, Louis-Dorr V, Heinrich C, Wolf D, Guillemin F. Principal component analysis and interpretation of bowel sounds. In: Proceedings of the 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society. vol. 1; 2004. p. 227–30.
  7. 7. Ranta R, Louis-Dorr V, Heinrich C, Wolf D, Guillemin F. Digestive activity evaluation by multichannel abdominal sounds analysis. IEEE Trans Biomed Eng. 2010;57(6):1507–19. pmid:20172793
  8. 8. Craine BL, Silpa M, O’Toole CJ. Computerized auscultation applied to irritable bowel syndrome. Dig Dis Sci. 1999;44(9):1887–92. pmid:10505730
  9. 9. Craine BL, Silpa ML, O’Toole CJ. Enterotachogram analysis to distinguish irritable bowel syndrome from Crohn’s disease. Dig Dis Sci. 2001;46(9):1974–9. pmid:11575452
  10. 10. Craine BL, Silpa ML, O’Toole CJ. Two-dimensional positional mapping of gastrointestinal sounds in control and functional bowel syndrome patients. Dig Dis Sci. 2002;47(6):1290–6. pmid:12064804
  11. 11. Hadjileontiadis L, Kontakos T, Liatsos C, Mavrogiannis C, Rokkas T, Panas S. Enhancement of the diagnostic character of bowel sounds using higher-order crossings. In: Proceedings. vol. 1022 ; 1999. p. 1027.
  12. 12. Yoshino H, Abe Y, Yoshino T, Ohsato K. Clinical application of spectral analysis of bowel sounds in intestinal obstruction. Dis Colon Rectum. 1990;33(9):753–7. pmid:2390910
  13. 13. Ching SS, Tan YK. Spectral analysis of bowel sounds in intestinal obstruction using an electronic stethoscope. World J Gastroenterol. 2012;18(33):4585–92. pmid:22969233
  14. 14. Kaneshiro M, Kaiser W, Pourmorady J, Fleshner P, Russell M, Zaghiyan K. Postoperative gastrointestinal telemetry with an acoustic biosensor predicts ileus vs. uneventful GI recovery. J Gastrointest Surg. 2016;20:132–9.
  15. 15. Spiegel B, Kaneshiro M, Russell M, Lin A, Patel A, Tashjian VC. Validation of an acoustic gastrointestinal surveillance biosensor for postoperative ileus. J Gastrointest Surg. 2014;18:1795–803.
  16. 16. Baronetto A, Graf LS, Fischer S, Neurath MF, Amft O. Segment-based spotting of bowel sounds using pretrained models in continuous data streams. IEEE J Biomed Health Inform. 2023;27(7):3164–74. pmid:37155392
  17. 17. Dimoulas C, Kalliris G, Papanikolaou G, Kalampakas A. Long-term signal detection, segmentation and summarization using wavelets and fractal dimension: a bioacoustics application in gastrointestinal-motility monitoring. Comput Biol Med. 2007;37(4):438–62. pmid:17026978
  18. 18. Ren Z, Chang Y, Nguyen TT, Tan Y, Qian K, Schuller BW. A comprehensive survey on heart sound analysis in the deep learning era. IEEE Comput Intell Mag. 2024;19(3):42–57.
  19. 19. Wanasinghe T, Bandara S, Madusanka S, Meedeniya D, Bandara M, De la Torre Díez I. Lung sound classification for respiratory disease identification using deep learning: a survey. Int J Onl Eng. 2024;20(10):115–29.
  20. 20. Kutsumi Y, Kanegawa N, Zeida M, Matsubara H, Murayama N. Automated bowel sound and motility analysis with CNN using a smartphone. Sensors (Basel). 2022;23(1):407. pmid:36617005
  21. 21. Zhao K, Jiang H, Yuan T, Zhang C, Jia W, Wang Z. A CNN based human bowel sound segment recognition algorithm with reduced computation complexity for wearable healthcare system. In: Proceedings of the 2020 IEEE International Symposium Circuits and Systems (ISCAS); 2020. p. 1–5.
  22. 22. Zhao K, Feng S, Jiang H, Wang Z, Chen P, Zhu B, et al. A binarized CNN-based bowel sound recognition algorithm with time-domain histogram features for wearable healthcare systems. IEEE Trans Circuits Syst II. 2022;69(2):629–33.
  23. 23. Allwood G, Du X, Webberley KM, Osseiran A, Marshall BJ. Advances in acoustic signal processing techniques for enhanced bowel sound analysis. IEEE Rev Biomed Eng. 2019;12:240–53. pmid:30307875
  24. 24. Du X, Allwood G, Webberley KM, Osseiran A, Marshall BJ. Bowel sounds identification and migrating motor complex detection with low-cost piezoelectric acoustic sensing device. Sensors (Basel). 2018;18(12):4240. pmid:30513934
  25. 25. Dimoulas C, Kalliris G, Papanikolaou G, Petridis V, Kalampakas A. Bowel-sound pattern analysis using wavelets and neural networks with application to long-term, unsupervised, gastrointestinal motility monitoring. Expert Systems with Applications. 2008;34(1):26–41.
  26. 26. Wang N, Testa A, Marshall BJ. Development of a bowel sound detector adapted to demonstrate the effect of food intake. BioMed Eng OnLine. 2022;21(1).
  27. 27. Liatsos C, Hadjileontiadis LJ, Mavrogiannis C, Patch D, Panas SM, Burroughs AK. Bowel sounds analysis: a novel noninvasive method for diagnosis of small-volume ascites. Dig Dis Sci. 2003;48(8):1630–6. pmid:12924660
  28. 28. Ficek J, Radzikowski K, Nowak JK, Yoshie O, Walkowiak J, Nowak R. Analysis of gastrointestinal acoustic activity using deep neural networks. Sensors (Basel). 2021;21(22):7602. pmid:34833679
  29. 29. Kim KS, Seo JH, Ryu SH, Kim MH, Song CG. Estimation algorithm of the bowel motility based on regression analysis of the jitter and shimmer of bowel sounds. Comput Methods Programs Biomed. 2011;104(3):426–34. pmid:21429614
  30. 30. Yin Y, Yang W, Jiang H, Wang Z. Bowel sound based digestion state recognition using artificial neural network. In: Proceedings of the IEEE Biomedical Circuits and Systems Conference (BioCAS); 2015. p. 1–4.
  31. 31. Kim K, Seo J, Song C. Non-invasive algorithm for bowel motility estimation using a back-propagation neural network model of bowel sounds. Biomed Closely Online. 2011.
  32. 32. Yin Y, Jiang H, Feng S, Liu J, Chen P, Zhu B, et al. Bowel sound recognition using SVM classification in a wearable health monitoring system. Sci China Inf Sci. 2018;61(8).
  33. 33. Dimoulas C, Kalliris G, Papanikolaou G, Kalampakas A. Abdominal sounds pattern classification using advanced signal processing and artificial intelligence. In: Proceedings of the International Conference on Computational Intelligence for Modelling Control and Automation (CIMCA 2003). Vienna, 2003. p. 71–82.
  34. 34. Kumar TS, Søiland E, Stavdahl Ø, Fougner AL. Pilot study of early meal onset detection from abdominal sounds. In: 2019 E-Health and Bioengineering Conference (EHB); 2019. p. 1–4.
  35. 35. Panah DS, Hines A, McKeever S. Exploring Wav2vec 2.0 model for heart murmur detection. In: 2023 31st European Signal Processing Conference (EUSIPCO); 2023. p. 1010–4.
  36. 36. Yu Y, Zhang M, Xie Z, Liu Q. Enhancing bowel sound recognition with self-attention and self-supervised pre-training. PLoS One. 2024;19(12):e0311503. pmid:39739653
  37. 37. Zhang X, Zhang X, Chen W, Li C, Yu C. Improving speech depression detection using transfer learning with wav2vec 2.0 in low-resource environments. Sci Rep. 2024;14(1):9543. pmid:38664511
  38. 38. Mansour Z, Uslar V, Weyhe D, Hollosi D, Strodthoff N. Bowel sounds signal; 2025. https://doi.org/10.6084/m9.figshare.28595741.v1
  39. 39. Mansour Z, Uslar V, Weyhe D, Hollosi D, Strodthoff N. SonicGuard sensor-a multichannel acoustic sensor for long-term monitoring of abdominal sounds examined through a qualification study. Sensors (Basel). 2024;24(6):1843. pmid:38544106
  40. 40. Wagner P, Strodthoff N, Bousseljot R-D, Kreiseler D, Lunze FI, Samek W, et al. PTB-XL, a large publicly available electrocardiography dataset. Sci Data. 2020;7(1):154. pmid:32451379
  41. 41. Sechidis K, Tsoumakas G, Vlahavas I. On the stratification of multi-label data. Springer; 2011.
  42. 42. Eyben F, Wöllmer M, Schuller B. openSMILE—The Munich versatile and fast open-source audio feature extractor. In: Proceedings of the 18th ACM International Conference on Multimedia (MM); 2010. p. 1459–62.
  43. 43. Weninger F, Eyben F, Schuller BW, Mortillaro M, Scherer KR. On the acoustics of emotion in audio: what speech, music, and sound have in common. Front Psychol. 2013;4:292. pmid:23750144
  44. 44. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16); 2016. p. 785–94.
  45. 45. Prokhorenkova L, Gusev G, Vorobev A, Dorogush A, Gulin A. CatBoost: unbiased boosting with categorical features. In: Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018); 2018.
  46. 46. Rizal A, Handzah VAP, Kusuma PD. Heart sounds classification using short-time fourier transform and gray level difference method. ISI. 2022;27(3):369–76.
  47. 47. Abdul ZKh, Al-Talabani AK. Mel frequency cepstral coefficient and its applications: a review. IEEE Access. 2022;10:122136–58.
  48. 48. Abdul ZKh, Al-Talabani AK. Mel frequency cepstral coefficient and its applications: a review. IEEE Access. 2022;10:122136–58.
  49. 49. Krizhevsky A, Sutskever I, Hinton G. ImageNet classification with deep convolutional neural networks. In: Neural Information Processing Systems. vol. 25; 2012.
  50. 50. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations; 2015.
  51. 51. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2015;7.
  52. 52. Bae SH, Choi IK, Kim NS. Acoustic scene classification using parallel combination of LSTM and CNN. In: DCASE. 2016. p. 11–5.
  53. 53. Koike T, Qian K, Kong Q, Plumbley MD, Schuller BW, Yamamoto Y. Audio for audio is better? An investigation on transfer learning models for heart sound classification. Annu Int Conf IEEE Eng Med Biol Soc. 2020;2020:74–7. pmid:33017934
  54. 54. Baevski A, Zhou Y, Mohamed A, Auli M. Advances in Neural Information Processing Systems. vol. 33; 2020. p. 12449–60.
  55. 55. Hsu W-N, Bolte B, Tsai Y-HH, Lakhotia K, Salakhutdinov R, Mohamed A. HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans Audio Speech Lang Process. 2021;29:3451–60.
  56. 56. Gemmeke JF, Ellis DPW, Freedman D, Jansen A, Lawrence W, Moore RC, et al. Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017. https://doi.org/10.1109/icassp.2017.7952261