UMobileNetV2 model for semantic segmentation of gastrointestinal tract in MRI scans

Neha Sharma; Sheifali Gupta; Deepali Gupta; Punit Gupta; Sapna Juneja; Asadullah Shah; Asadullah Shaikh

doi:10.1371/journal.pone.0302880

Abstract

Gastrointestinal (GI) cancer is leading general tumour in the Gastrointestinal tract, which is fourth significant reason of tumour death in men and women. The common cure for GI cancer is radiation treatment, which contains directing a high-energy X-ray beam onto the tumor while avoiding healthy organs. To provide high dosages of X-rays, a system needs for accurately segmenting the GI tract organs. The study presents a UMobileNetV2 model for semantic segmentation of small and large intestine and stomach in MRI images of the GI tract. The model uses MobileNetV2 as an encoder in the contraction path and UNet layers as a decoder in the expansion path. The UW-Madison database, which contains MRI scans from 85 patients and 38,496 images, is used for evaluation. This automated technology has the capability to enhance the pace of cancer therapy by aiding the radio oncologist in the process of segmenting the organs of the GI tract. The UMobileNetV2 model is compared to three transfer learning models: Xception, ResNet 101, and NASNet mobile, which are used as encoders in UNet architecture. The model is analyzed using three distinct optimizers, i.e., Adam, RMS, and SGD. The UMobileNetV2 model with the combination of Adam optimizer outperforms all other transfer learning models. It obtains a dice coefficient of 0.8984, an IoU of 0.8697, and a validation loss of 0.1310, proving its ability to reliably segment the stomach and intestines in MRI images of gastrointestinal cancer patients.

Citation: Sharma N, Gupta S, Gupta D, Gupta P, Juneja S, Shah A, et al. (2024) UMobileNetV2 model for semantic segmentation of gastrointestinal tract in MRI scans. PLoS ONE 19(5): e0302880. https://doi.org/10.1371/journal.pone.0302880

Editor: Sally Mohammed Farghaly, Alexandria University Faculty of Nursing, EGYPT

Received: April 12, 2023; Accepted: April 14, 2024; Published: May 8, 2024

Copyright: © 2024 Sharma et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: Request URL open source UW-Madison database.

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

The term "gastrointestinal tract" refers to humans and animals’ entire digestive system (from mouth to anus). The gastrointestinal (GI) tract, also known as the digestive tract, is a lengthy tubular organ that spans from the oral cavity to the rectum. Its primary function is to facilitate the breakdown and assimilation of nutrients from ingested food [1]. The gastrointestinal system is essential for the process of food digestion, where it breaks down food into smaller molecules that may be readily absorbed by the body, extracting nutrients for energy and growth, and eliminating waste products from the body [2]. It is also closely connected with the immune system and essential for maintaining overall health and well-being.

In the past 20 years, substantial advancements have been made to automatically diagnose the disorder in the digestive system and other human organs [1–5]. Gastrointestinal illness is a prevalent manifestation of these conditions [6]. Gastrointestinal cancer is the most prevalent kind of cancer in both males and females [5]. In 2019, the global incidence of gastrointestinal cancer exceeded 5 million cases. GLOBOCAN 2020 projections indicate that Gastrointestinal (GI) cancer claimed the lives of 800,000 individuals, accounting for 7.7% of all cancer-related deaths. It ranks as the fourth leading cause of cancer mortality in both men and women [7]. In the year 2020, a total of 1.1 million new instances of gastrointestinal (GI) cancer were identified, representing 5.6% of all cancer cases [8].

Treatment for gastrointestinal cancer is affected by the age of patients, their health, and the stage or type of cancer they suffer from [7]. The most common therapies for GI cancer are surgery, radiation treatment, and chemotherapy. Radiation treatment is usually given for 15 minutes daily for few weeks. Radiation oncologists employ this technique to treat cancers with solid doses of radiation while neglecting stomach and intestines [6]. With more recent technology like MRI and linear accelerator devices, commonly known as MR-Linacs, oncologists may see the tumor’s and intestines’ potential daily changes [8–10]. In order to provide large amounts of X-ray radiation, it is necessary for a system to precisely divide the organs in the gastrointestinal tract into distinct segments. This automated technology has the capability to enhance the pace of cancer therapy by aiding the radio oncologist in the process of segmenting the organs of the gastrointestinal tract.

Recent advancements show that deep learning algorithms are capable of segmenting GI tract organs [6–10]. Organ segmentation is essential for diagnostic and monitoring systems [11]. Deep learning algorithms, especially convolutional neural network-based architecture, are highly suitable to solve this issue of GI Tract segmentation [12]. Recent decades have seen encouraging results in Convolutional Neural Network, with the disorders diagnosing in various human organs. The CNN model is advantageous because it hierarchically extracts features, beginning with the most basic and working to the most abstract. The deep learning algorithms most efficiently used for model optimization are Dwarf Mongoose and Aquila Optimizer [13,14]. Clinical procedures such as diagnosis, therapy planning, and administration can benefit from organ segmentation. In this scenario, the digestive tract segmentation could benefit from a DL method, speeding up treatments and allowing patients to get more effective caution [12]. The proposed work has built a deep learning approach for automatic segmentation of the stomach and intestines in the Gastrointestinal tract in MRI scans. These MRI scans were taken during radiation treatment of actual cancer patients who experienced 1–6 scans per week, depending upon the stage of cancer. The main offerings of the anticipated research are as follows:

Here, a UMobilenetV2 network is simulated by integrating MobileNet into the contraction path of UNet architecture. In contrast, layers of UNet are incorporated in the expansion path as the decoder to enhance the local feature extraction in the segmenting the GI tract using MRI images.
The model has been implemented on the UW-Madison GI tract dataset to segment the stomach, small intestine, and large intestine in the GI tract. The model is examined using Adam, RMS prop, and SGD optimizers.
The model is also compared with three transfer learning models named Xception, ResNet 101, and NASNet, which are used as encoders in UNet architecture. The approach is assessed based on performance parameters like model loss, dice coefficient, and IoU.

The leftover sections of this article are ordered as section 2 presents the related work for classification and segmentation in the GI tract. Section 3 describes the methodology for this research work. Section 4 shows results and discussion, section 5 shows state-of-the-art comparison, and section 6 concludes the overall job done in this research.

2. Related work

In recent years, several researchers have worked the categorization and segmentation of the gastrointestinal system. Table 1 summarises current, significant learning-based advancements in this domain. Cogan T. et al. [15] created the MAPGI framework in 2019 for modular and automated pre-processing of gastrointestinal images. For the Kvasir dataset, some pre-processing procedures include edge elimination, filtering, and color mapping. Deep learning architectures, Inception-v4, Inception-v2, and NASNet, achieved accuracy scores of 0.9845, 0.9848, and 0.9735 for GI Tract segmentation. Sharif M. et al. [16] proposed an approach to merging deep convolutional and geometric characteristics in 2019. The suggested technique was evaluated on a database of 5500 images and demonstrated classification accuracy and precision of 99.42% and 99.51%, respectively. Gamage C. et al. 2019 predicted eight-class GI disease anomalies using a mixture of DenseNet-201, VGG-16, ResNet-18, and CNN followed by a global average pooling layer [17]. D. E. Diamantis et al. proposed a strategy for coping with the inadequate data in 2019 by employing synthetically created pictures. A CNN was trained utilising WCE photographs [18]. Ozturk S. suggested an incredibly well-organized LSTM model, which will be merged into CNN output in 2020 [19]. Lafraxo S. et al. proposed a DL model which employs a deep convolutional network and achieves 96.89% accuracy on the Kvasir dataset [20]. Hmoud Al-Adhaileh, M. et al. used the Kvasir dataset to train GoogleNet, ResNet-50, and AlexNet deep learning-based networks in 2021. AlexNet provided the best results, with 97% accuracy [21]. Yogapriya J. et al. used classic image processing methods, a data augmentation strategy, and a deep network to categorize GI disorders in wireless endoscopic pictures [22]. In [23], S. Ozturk introduced a model that combines a CNN with a residual LSTM. Montalbo et al. [24] recommended the Multi-Fused Residual CNN (MFuRe-CNN) for analyzing endoscopic images of GI illnesses using the Kvasir dataset in 2022. Gibson et al. reported a neural network design for segmentation of eight organs [25]. The pancreas, digestive system, "esophagus, stomach, and duodenum," are all necessary for endoscopic referral to the biliary and pancreatic processes. Wang S. et al. published a multi-scale deep network in 2020 to eventually segment gastrointestinal (GI) lesions from endoscopic images [26]. Khan M. A. et al. proposed an approach for categorizing and diagnosing GI ulcers, polyps, and hemorrhages in 2020. It was recommended in [27] to employ a Recurrent CNN tailored for ulcer segmentation. Garden et al. [28] 2021 established a technique for segmentation of canonical method appropriate to identifying GI polyps via a direct extension [29]. According to the literature, the gastrointestinal system has been the substantially researched in last years, including classification and segmentation. The study made use of a variety of datasets, including endoscopic and CT scan pictures. The proposed study uses MRI images to provide a unique method for segmenting the stomach and intestines in the GI system.

Download:

Table 1. Literature review on gastrointestinal tract.

https://doi.org/10.1371/journal.pone.0302880.t001

3. Proposed methodology

This part discusses the methodology for segmentation and classification of the stomach, small bowel, and large bowel in MRI scans. Section A represents the input dataset. Section B represents the ground truth mask generation using Run Length Encoding (RLE). Section C will discuss the data augmentation applied to the dataset. Section D will discuss the model for segmenting the gastrointestinal tract. Section E shows the details of the three encoders used for UNet model. Section F shows the performance matrices used to analyze the model and three encoders. Fig 1 represents the flow chart of methodology for segmenting the stomach and intestines in the GI tract.

Download:

Fig 1. Flow chart for automatic segmentation of small bowel, large bowel, and stomach in GI tract.

https://doi.org/10.1371/journal.pone.0302880.g001

The Fig 1 displays the flow chart of the suggested technique. Fig 1 outlines a comprehensive process for semantic segmentation of MRI scans, utilizing UW Madison dataset comprising of 38,496 MRI images. The primary goal is to accurately segment GI tract organs such as small bowel, large bowel, and stomach from the input dataset. Beginning with the dataset input, ground truth masks are generated through Run-Length Encoding (RLE), paving the way for subsequent steps. Employing data augmentation techniques enhances model robustness. Here a semantic segmentation UMobileNet V2 model is simulated in which MobileNet V2 is used as an encoder in UNet Model for segmenting GI organs. The crux of the workflow lies in comparing the UMobileNet V2 model with three distinct encoders (Xception, ResNet 101, and NasNet Mobile). These models undergo optimization with three optimizers—Adam, RMSprop, and SGD. Rigorous performance evaluation, utilizing metrics like loss, Dice coefficient, and Intersection over Union, facilitates comparison of model effectiveness. The proposed technique is further simulated with higher number of epochs to check its performance. The workflow concludes with the visualization of results from the best-performing model, offering a clear representation of the model’s prowess in accurately segmenting gastrointestinal structures within MRI scans. Overall, this systematic approach thoroughly explores segmentation methodologies, leading to informed model selection and meaningful insights into MRI image analysis.

A. Input dataset

The University of Wisconsin-Madison, a public research university in Madison, Wisconsin, has published a dataset of MRI scans [30]. The dataset comprises 85 individuals who underwent scans during a period ranging from 1 to 6 days. Each daily scan consists of either 144 or 80 slices, which are used for various patients. Therefore, the dataset has a total of 38496 MRI images. The images in the database vary, with sizes of 266x266, 310x360, and 276x276. All images were resized to 224x224 to make them uniform for training purposes. Fig 2A and 2B show sample MRI scans of database.

Download:

Fig 2. Images of the UW-Madison database.

https://doi.org/10.1371/journal.pone.0302880.g002

B. Ground truth mask generation

The dataset contains 38496 MRI slices, and each MRI slice has three annotations for Small bowel, large bowel, and stomach in RLE encoded forms provided in the CSV file. Hence, there are 115488 annotations given in the CSV file. Out of 115488 annotations, 14085 cases are for large bowel, 11201 are for small bowel, whereas 8627 cases are for stomach. The remaining 81575 annotation cases do not have any large, small, or stomachs. The ground truth mask is derived from these annotations using the RLE encoder. For example, Fig 3A shows the original 82-number slice of the 20th day’s scan of patient ID 123. Fig 3B shows the RLE encoding of the large intestine, Fig 3C represents the RLE encoding of the small intestine, and Fig 3D represents the RLE encoding of stomach of the same slice.

Download:

Fig 3.

Ground Truth Mask Generation (a) Original Image, (b) RLE Encoding for Large Bowel, (c) RLE Encoding for Small Bowel, and (d) RLE Encoding for Stomach.

https://doi.org/10.1371/journal.pone.0302880.g003

The Table 2 provides a breakdown of annotations for different anatomical regions, namely Large Bowel, Small Bowel, Stomach, and a category labeled as Blanks. The dataset is partitioned into training (80%), testing (10%), and validation (10%) subsets. In the training set, there are 11,989 annotations for the Large intestine, 8,961 for the Small intestine, 6,903 for the Stomach, and 65,261 for the Blanks category. The testing and validation sets each contain 1,408 annotations for the Large intestine, 1,120 for the Small intestine, 862 for the Stomach, and 8,157 for the Blanks category. These annotations likely represent a dataset used for training and evaluating proposed technique, for segmentation of GI tract organs.

Download:

Table 2. Dataset splitting in training, testing and validation.

https://doi.org/10.1371/journal.pone.0302880.t002

C. Data augmentation

The dataset is unbalanced here, with 14085 large bowel cases, 11,201 small cases, and 8,627 stomach cases. The dataset balancing is done on stomach cases using data replication by increasing its number from 8627 to 10783. Data augmentation is also applied to enhance the data to make it more compatible with the model. It enhances the diversity of images and acts as a dataset regularizer. It enhances the images by making alterations while preserving the class label. The augmentations employed in this dataset include horizontal flipping, vertical flipping, and rotation by 80° degrees. Fig 4 displays the unaltered and enhanced images derived from the dataset. Fig 4A and 4E display the original photos, (b) and (f) show the images after a horizontal flip, (c) and (g) show the images after a vertical flip, and (d) and (h) show the images after a rotation.

Download:

Fig 4.

Sample Images After Applying the Augmentation Techniques; (a) & (e) Original Images, (b) &(f) Horizontal Flip, (c) & (g) Vertical Flip, and (d) & (h) Rotation.

https://doi.org/10.1371/journal.pone.0302880.g004

D. UMobileNetV2 model

The simulated model is a fusion of UNet and mobilenet V2 model [31] for semantic segmentation of the large intestine, small intestine, and stomach in the GI tract on MRI data for cancer treatment. U-Net is a CNN model developed by Olaf Ronneberger et al. [32,33] for segmentation. U-Nets allow us to go beyond traditional image categorization and object recognition methods by assigning shapes to each pixel inside an image. It extends the conventional CNN architecture by adding a suitable expansion path (decoder) to provide a high-definition semantic prediction. Fig 5 shows a block schematic of the model. The first path is the encoder’s contraction path, which records the image’s features. The contraction is a structure comprised convolution and max pooling layers. Similarly, the expansion path (decoder) facilitates accurate localization by employing transposed convolutional layers; it does not include a dense layer and can process images of any shape. U-Net developed its name because its two branches resemble the letter U from the English alphabet.

Download:

Fig 5. Block diagram of UMobileNetV2 for segmentation.

https://doi.org/10.1371/journal.pone.0302880.g005

Instead of a CNN, the proposed model used the transfer learning model MobileNet V2, which had already been pre-trained. MobileNet V2 [34] is a CNN design intending to improve performance on mobile devices. It is predicated on a backward residual structure, with the bottleneck levels connecting via residual nodes. Lightweight depthwise convolutions filter features are used as a non-linear source in the intermediate expansion layer. The convolutions in MobileNet V2 are separated depthwise. It reduces the number of parameters compared to a network constructed using ordinary convolutions.

Consequently, compact deep neural networks are generated. In place of a one 3x3 convolution layer followed by batch normalization and ReLU, MobileNet design uses a 2x2 convolution layer followed by batch normalization. Specifically, MobileNet V2 performed a 3x3 depthwise convolution and a 1x1 pointwise convolution. 13 downsampling blocks are used, each with its unique configuration of convolution, batch normalization, and ReLU layer. When an image is divided into blocks, it loses resolution but gains depth by increasing the feature maps. It is chosen for downsampling purposes in the proposed U-Net architecture because of its many benefits, such as its small size and low processing time. Table 3 shows a detailed description of the recommended model’s different layers.

Download:

Table 3. Detailed description of layers of the UMobileNet model.

https://doi.org/10.1371/journal.pone.0302880.t003

The computation cost of the UMobileNet V2 model is also measured in form of FLOPs that involves multiplying the number of operations per parameter by the total trainable parameters, the batch size, and the number of training iterations. The formula for FLOPs can be expressed as:

FLOPs = 2×Number of Operations per Parameter×Trainable Parameters×Batch Size×Number of Iterations

Here number of operations per parameter is assumed to be 2, Trainable Parameters are equal to 409,059 and Batch Size is taken as 16. So number of FLOPs used in the UMobileNet V2 model is 2×2×409,059×16×1 = 13,097,152.

So, for one iteration with a batch size of 16, the computation cost is approximately 13,097,152 FLOPs. The computational cost of the proposed model is comparatively less than other encoders because here MobileNet V2 is used as encoder. MobileNetV2 employs depthwise separable convolutions, a technique that divides a conventional convolution into two distinct operations: a depthwise convolution and a pointwise convolution. This minimises the quantity of parameters and calculations compared to traditional convolutions, leading to lower computational cost.

E. Simulations parameters

In addition to the model’s structure, it is also essential to recognize the network’s execution and presentation. During the deep neural model’s training, many parameter choices were made. The MobileNet V2 model was used to build the proposed network, which was compared with three different transfer learning methods namely Xception, ResNet 101, and NASNet mobile. The model’s weights were set through Golort initialization [35]. The loss function used for the simulation is Tversky loss. It is commonly used as a loss function in image segmentation tasks, especially in medical image analysis. The formula for calculating Tversky loss is:

Where TP is true positive, FP is a false positive, FN is false negative, α and β are weight parameters that allow adjusting the balance between false positive and false negative.

The model’s performance has been evaluated using several parameters: Adam, RMS, and SGD. Batch sizes of 16 and 10 epochs were used to run the model. These parameters were assessed using UW Madison dataset. The model’s learning rate is 0.0001. Python and KerasTensorflow [36] Package were used to build model. Keras is a free and simple tool for developing neural networks. NVIDIA Tesla P100 GPU is used for the simulation. It is open-source and compatible with Tensorflow and Theano. All the simulations were carried out using google colab notebook.

F. Different encoders used for UNet model

Transfer learning is a method that reuses a network proficient for a job as an initial step of a model for a second relevant job. The idea is to transfer knowledge gained from solving one problem to another related issue so that less data can be used to train a more accurate or efficient model. This is especially useful when labelled data for new tasks is scarce.

a. Xception model.

Xception [37] is a deep network aimed to overcome the limitations of starting models for image classification tasks. Xception uses depth-separable convolutions, which can significantly reduce computational complexity and improve model performance. This architecture allows models to learn more efficiently by reducing the parameters while preserving solid, expressive power. Xception models are used for various computer vision tasks like recognition, segmentation, and fine-grained picture categorization.

b. ResNet 101 model.

ResNet-101 is a deep CNN competent in accomplishing image categorization tasks. Featured in his 2016 article "Deep Residual Learning for Image Recognition" by Microsoft researchers He, Zhang, Ren, and Sun [38]. This model is an extension of the ResNet 50 model, a variation of the traditional CNN architecture. The core concept behind the ResNet network is the introduction of residual connections that allow the network to learn its ID function in addition to the traditional convolution and pooling layers. The number "101" in the model name denotes the layers in the model, which are much more profound than other CNN architectures, such as VGG-16 and AlexNet. As a result, ResNet-101 can learn more powerful and complex feature representations from the input data, improving the performance of image classification tasks.

c. NASNet model.

NASNet (Neural Architecture Search Network) Mobile is a deep CNN developed for picture identification tasks and designed to be implemented on mobile and implanted strategies with limited computational resources. The model was introduced in the paper "Learning Transferable Architectures for Scalable Image Recognition" by Google researchers Zoph and Le in 2017 [39]. NASNet is based on the Automatic Neural Architecture (NAS) search method, which automatically uses reinforcement learning to find the optimal network architecture. This method learns to identify the best building blocks for your model and its placement.

G. Performance metrics

Intersection over union (IoU) and Dice Coefficient are often employed metrics for evaluating the efficacy of segmentation methods.

a. IoU: The Jaccard index is another name for it. This is one of the most commonly used metrics for segmentation.

The Intersection over Union (IoU) is calculated by dividing the region of overlap between the expected and real segmentation by the area of union between the anticipated and actual segmentation. The measurements span a scale of 0 to 1, where a value of 0 indicates no overlap and a value of 1 indicates perfect overlap.

b. Dice: The term "F1 score" is also used to refer to it. The dice coefficient is calculated by multiplying the area of overlap between two images by two, and then dividing it by the total number of pixels in both images.

There is a positive correlation between the Dice coefficient and the IoU coefficient. Both ranges span from 0 to 1, where a value of 1 indicates the highest degree of similarity between the predicted and actual outcomes, while a value of 0 indicates the lowest level of resemblance.

4. Results & discussions

The following sections show the results of the UMobileNet V2 model and UNet model simulated with three encoders with three different optimizers for segmentation of GI tract. The results were obtained using four encoders namely; MobileNet V2, Xception, ResNet 101, NASNet Mobile with three optimizers: Adam [40], RMS [41], and SGD [42].

A. Results for adam optimizer

This section shows the results of different encoders obtained using Adam optimizer.

a. Loss analysis.

The UNet model ensembles with different encoders were assessed using loss, dice, and IoU. Fig 6A displays the loss plot of the xception network, Fig 6B represents the loss plot for ResNet 101 model, Fig 6C displays the loss curve of the NASNet network, and Fig 6D represents the results of UMobileNet V2 model using Adam optimizer. From Fig 6, concludes that the UMobileNet V2 model obtains the least loss in comparison with other encoder networks.

Download:

Fig 6. Loss analysis for different encoders using adam optimizer.

(a) Xception, (b) ResNet 101, (c) NASNet Mobile, and (d) UMobileNet V2 Model.

https://doi.org/10.1371/journal.pone.0302880.g006

b. Dice coefficient analysis.

The UNet with transfer learning designs were assessed utilizing the dice. Fig 7A represents the dice curve of xception model, Fig 7B represents the dice coefficient plot for ResNet 101 model, Fig 7C represents the dice coefficient curve of the NASNet model, and Fig 7D represents the results of UMobileNet V2 model using Adam optimizer. Fig 7 demonstrates that, compared to other models, the UMobileNet V2 model yields the most excellent dice coefficient value.

Download:

Fig 7. Dice coefficient analysis for different encoders using adam optimizer.

(a) Xception, (b) ResNet 101, (c) NASNet Mobile, and (d) UMobileNet V2 Model.

https://doi.org/10.1371/journal.pone.0302880.g007

c. IoU analysis.

The IoU coefficient was utilized to compare the UMobileNet V2 model to all other encoder models. Fig 8A displays the IoU curve for the xception model, Fig 8B shows the IoU curve for the ResNet 101 model, Fig 8C depicts the IoU curve for the NASNet model, and Fig 8D displays the plot of the model using the Adam optimizer. Regarding the IoU coefficient, Fig 8 concludes that the model performs better than every other transfer learning model.

Download:

Fig 8.

IoU Analysis for Different Encoders using Adam Optimizer (a) Xception, (b) ResNet 101, (c) NASNet Mobile, and (d) UMobileNet V2 Model.

https://doi.org/10.1371/journal.pone.0302880.g008

Fig 9 compares the outcomes for the Adam optimizer for each model in terms of loss, dice coefficient, and IoU. The image shows that the UMobileNet V2 model performed better than previous transfer learning models. Using the Adam optimizer, the presented model produced the most significant dice coefficient with a value of 0.8904, the lowest loss value of 0.1310, and the best IoU value of 0.8697.

Download:

Fig 9. Results comparison of UMobileNet V2 model with different TL models with adam optimizer using test dataset.

https://doi.org/10.1371/journal.pone.0302880.g009

B. Results for RMS optimizer

The models’ loss, dice, and IoU are also evaluated using RMS optimizers. The following section shows the loss, dice, and IoU plots for different models using the RMS optimizer.