Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Improved performance and consistency of deep learning 3D liver segmentation with heterogeneous cancer stages in magnetic resonance imaging

  • Moritz Gross,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliations Department of Radiology and Biomedical Imaging, Yale University School of Medicine, New Haven, Connecticut, United States of America, Charité Center for Diagnostic and Interventional Radiology, Charité—Universitätsmedizin Berlin, Berlin, Germany

  • Michael Spektor,

    Roles Supervision, Validation, Writing – review & editing

    Affiliation Department of Radiology and Biomedical Imaging, Yale University School of Medicine, New Haven, Connecticut, United States of America

  • Ariel Jaffe,

    Roles Data curation, Validation, Writing – review & editing

    Affiliation Department of Internal Medicine, Yale University School of Medicine, New Haven, Connecticut, United States of America

  • Ahmet S. Kucukkaya,

    Roles Data curation, Writing – review & editing

    Affiliations Department of Radiology and Biomedical Imaging, Yale University School of Medicine, New Haven, Connecticut, United States of America, Charité Center for Diagnostic and Interventional Radiology, Charité—Universitätsmedizin Berlin, Berlin, Germany

  • Simon Iseke,

    Roles Data curation, Writing – review & editing

    Affiliations Department of Radiology and Biomedical Imaging, Yale University School of Medicine, New Haven, Connecticut, United States of America, Department of Diagnostic and Interventional Radiology, Pediatric Radiology and Neuroradiology, Rostock University Medical Center, Rostock, Germany

  • Stefan P. Haider,

    Roles Validation, Writing – review & editing

    Affiliations Department of Radiology and Biomedical Imaging, Yale University School of Medicine, New Haven, Connecticut, United States of America, Department of Otorhinolaryngology, University Hospital of Ludwig Maximilians Universität München, Munich, Germany

  • Mario Strazzabosco,

    Roles Supervision, Writing – review & editing

    Affiliation Department of Internal Medicine, Yale University School of Medicine, New Haven, Connecticut, United States of America

  • Julius Chapiro,

    Roles Conceptualization, Methodology, Project administration, Resources, Supervision, Validation, Writing – review & editing

    Affiliation Department of Radiology and Biomedical Imaging, Yale University School of Medicine, New Haven, Connecticut, United States of America

  • John A. Onofrey

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    john.onofrey@yale.edu

    Affiliations Department of Radiology and Biomedical Imaging, Yale University School of Medicine, New Haven, Connecticut, United States of America, Department of Urology, Yale University School of Medicine, New Haven, Connecticut, United States of America, Department of Biomedical Engineering, Yale University, New Haven, Connecticut, United States of America

Improved performance and consistency of deep learning 3D liver segmentation with heterogeneous cancer stages in magnetic resonance imaging

  • Moritz Gross, 
  • Michael Spektor, 
  • Ariel Jaffe, 
  • Ahmet S. Kucukkaya, 
  • Simon Iseke, 
  • Stefan P. Haider, 
  • Mario Strazzabosco, 
  • Julius Chapiro, 
  • John A. Onofrey
PLOS
x

Abstract

Purpose

Accurate liver segmentation is key for volumetry assessment to guide treatment decisions. Moreover, it is an important pre-processing step for cancer detection algorithms. Liver segmentation can be especially challenging in patients with cancer-related tissue changes and shape deformation. The aim of this study was to assess the ability of state-of-the-art deep learning 3D liver segmentation algorithms to generalize across all different Barcelona Clinic Liver Cancer (BCLC) liver cancer stages.

Methods

This retrospective study, included patients from an institutional database that had arterial-phase T1-weighted magnetic resonance images with corresponding manual liver segmentations. The data was split into 70/15/15% for training/validation/testing each proportionally equal across BCLC stages. Two 3D convolutional neural networks were trained using identical U-net-derived architectures with equal sized training datasets: one spanning all BCLC stages (“All-Stage-Net": AS-Net), and one limited to early and intermediate BCLC stages (“Early-Intermediate-Stage-Net": EIS-Net). Segmentation accuracy was evaluated by the Dice Similarity Coefficient (DSC) on a dataset spanning all BCLC stages and a Wilcoxon signed-rank test was used for pairwise comparisons.

Results

219 subjects met the inclusion criteria (170 males, 49 females, 62.8±9.1 years) from all BCLC stages. Both networks were trained using 129 subjects: AS-Net training comprised 19, 74, 18, 8, and 10 BCLC 0, A, B, C, and D patients, respectively; EIS-Net training comprised 21, 86, and 22 BCLC 0, A, and B patients, respectively. DSCs (mean±SD) were 0.954±0.018 and 0.946±0.032 for AS-Net and EIS-Net (p<0.001), respectively. The AS-Net 0.956±0.014 significantly outperformed the EIS-Net 0.941±0.038 on advanced BCLC stages (p<0.001) and yielded similarly good segmentation performance on early and intermediate stages (AS-Net: 0.952±0.021; EIS-Net: 0.949±0.027; p = 0.107).

Conclusion

To ensure robust segmentation performance across cancer stages that is independent of liver shape deformation and tumor burden, it is critical to train deep learning models on heterogeneous imaging data spanning all BCLC stages.

Introduction

Liver cancer is the third most common cause of cancer-related death worldwide [1] and both incidence rates and mortality are rising [2, 3]. Hepatocellular carcinoma (HCC) is the most prevalent form of primary liver cancer, accounting for 70–85% of liver cancers globally [4]. Magnetic resonance (MR) imaging offers high tissue contrast and with the use of contrast agents and multiphasic imaging, HCC can be detected and diagnosed reliably without the need for an invasive biopsy in a majority of cases [5]. Multiple staging systems have been developed to assess the stage of HCC and to provide guidance regarding optimal therapeutic management [610]. In particular, the Barcelona Clinic Liver Cancer (BCLC) staging classification [6] is widely accepted and the most commonly used in Western cohorts. The BCLC classification utilizes three clinical elements: tumor burden, functional status as measured by the Eastern Cooperative Oncology Group (ECOG) Performance Status [11], and underlying liver function measured by the Child-Pugh class [12] to stratify patients into five staging categories: very early stage (BCLC-0), early stage (BCLC-A), intermediate stage (BCLC-B), advanced stage (BCLC-C), and terminal stage (BCLC-D).

Accurate organ segmentation plays an important role in medical image analysis tasks. Liver segmentation is key for volumetry prior to therapeutic interventions [1317] and as a pre-processing step for subsequent cancer detection algorithms [18, 19]. Accurate volumetry assessment is imperative to understanding the risk of hepatic decompensation associated with various treatment approaches and plays a critical role in management decisions. It has been shown that the critical residual liver volume necessary to prevent post-hepatectomy liver failure in non-cirrhotic patients is 20–30%, compared to at least 40% residual volume in cirrhotic patients. Thus, possible curative therapies again rely heavily on accurate volume assessment in patients with liver cancer [20]. However, manual segmentation is time-consuming and dependent on the rater’s level of experience, which leads to a lack of reproducibility and inter-observer variability [21]. Heterogeneity in terms of disease stage and imaging appearance further complicates segmentation. Liver segmentation can be especially challenging in patients with abnormal liver function and significant disease complexity. Various morphologic changes occur in cirrhotic patients including left lobe hypertrophy, increased nodularity of the liver surface, portal hypertension often manifesting with significant ascites and changes in vasculature in addition to cancer-related tissue changes that alter the liver contour all contribute to substantial variations in the imaging morphology [22, 23]. In this paper, we use the BCLC classification as a marker for liver function, severity of HCC, disease complexity, and overall imaging heterogeneity.

To improve liver segmentation reproducibility, automated methods based on image analysis methods and machine learning have been developed and shown promising results [2427]. Current state-of-the-art methods utilize deep learning based on convolutional neural networks (CNNs) [28]. Such CNNs have demonstrated superior segmentation results across a wide variety of medical image segmentation applications [29] and also have the advantage of processing times in the order of seconds. In particular, these algorithms have been applied to segment the liver on computed tomography (CT) and MRI data [3042]. However, machine learning algorithms, and in particular high-dimensional and non-linear deep learning algorithms, are prone to over-fitting, which results in models that are not robust to data that varies substantially from their training data [43]. This is a problem of distributional shift, or dataset shift, where a mismatch between distributions of training data and testing data exists [44]. Software development specifications aimed at ensuring quality in the development and the use of AI modules identify distributional shift as one of the major risks to robust application of AI [45]. To avoid distributional shifts caused by sample selection bias, it is critical that algorithms be trained on data representative of the test set.

Therefore, deep learning liver segmentation algorithms trained only on early and intermediate HCC stages will result in algorithms tuned to this specific patient population and thus fail to generalize to more advanced stages due to their heterogeneous imaging morphology. The aim of this study was to assess the ability of state-of-the-art deep learning 3D liver segmentation algorithms to generalize across the clinical distribution all different BCLC liver cancer stages.

Materials and methods

Inclusion of patients

This HIPAA-compliant, retrospective, single-institution study was IRB-approved with full waiver of consent and included all patients from an institutional database with T1-weighted arterial-phase MR images and a corresponding manual liver segmentation available for processing. All patients were >18 years old and had treatment-naïve HCC that was either imaging- or histopathologically-proven. Patient data was collected from the hospital’s electronic health record and all patients were retrospectively staged according to the BCLC staging system.

Magnetic resonance imaging data

MR images were acquired between the years 2008 and 2019. Images were downloaded from the Picture archiving and communication system (PACS) server, de-identified using in-house software and subsequently converted to the Neuroimaging Informatics Technology Initiative (NIfTI) format. All patients underwent a standard institutional imaging protocol for triphasic MR image acquisition. Arterial phase images were used for liver segmentation because most HCC lesions display arterial phase hyperenhancement (APHE), which is reflected in the current LI-RADS criteria [46]. Tumors with APHE exhibit good contrast and high signal-to-noise ratio which facilitates tumor delineation. Late arterial-phase T1-weighted breath-hold sequences were acquired 12–18 seconds (s) post-contrast injection with several gadolinium-based contrast agents. Images were acquired on a variety of scanners with different field strengths (1.16T, 1.5T, and 3T). Full details of the imaging parameters can be found in the (S1 Table). Briefly, the median repetition time (TR) and median echo time (TE) were 4.39 ms and 2 ms, respectively. The median slice thickness was 3 mm, the median bandwidth 445 Hz, and the image matrix ranged from 1406×138 to 3206×247. All liver segmentations were done by a medical student (M.G., over 2.5 years of image analysis training) under the supervision of a board-certified abdominal radiologist (M.S., 10 years of experience) using 3D Slicer (v4.10.2) [47].

Data partition

Early and intermediate BCLC stage (i.e., BCLC-0, BCLC-A, BCLC-B) patients were randomly split into training, validation, and testing sets containing 70%, 15%, and 15% of the subjects, respectively. Due to the relatively lower number of data samples of late BCLC stages (i.e., BCLC-C, BCLC-D), these subjects were split among the training and testing sets to each contain 50% of the subjects, respectively. The sampled subjects from the set of training data were then used to create two equally sized training subsets.

Model development

Two deep neural networks were trained in a supervised manner to automatically segment the liver from 3D arterial-phase MR images. Both models have an identical fully-convolutional encoder-decoder architecture [48] based on the U-net [49] that includes residual units [50] and uses 3D convolution operations (see Sec. S1 File for details). The only difference between the two algorithms were the datasets used for training, which were composed from different combinations of BCLC stages. The first model, “Early-Intermediate-Stage-Net” (EIS-Net), was trained on early and intermediate BCLC stages. The second model, “All-Stage-Net” (AS-Net), was trained using a dataset comprised of all five BCLC stages. Both models used the same validation set and were tested on the same test set. The manual liver segmentations were used as ground-truth.

The input MR images were standardized to have isotropic voxel spacing of 2mm3 and intensities were scaled so that the 25th and 75th percentile ranged between -0.5 and +0.5 [51]. For model training, random 3D image patches (64×64×32 voxels) were extracted in a 3:1 ratio centered on the liver mask compared to the background image to focus model training on the liver. Both models were trained over 2000 epochs using mini-batches of 64 patches and the Dice similarity loss function [52] using the Adam optimizer [53] with a fixed learning rate of 0.0001. Dice loss was optimized as this metric represents evaluation of the segmentation task at hand. The framework for model training and evaluation is depicted in Fig 1.

thumbnail
Fig 1. Overview of the training and evaluation framework for the automated 3D liver segmentation method.

Training input consists of 3D arterial-phase magnetic resonance image (MRI) volumes with corresponding manually annotated ground-truth liver segmentation masks. To evaluate model performance in an independent test set, the output liver segmentations were compared to annotated ground-truth.

https://doi.org/10.1371/journal.pone.0260630.g001

Models were implemented in Python (v3.7) using PyTorch (v1.5.1) and the open-source Medical Open Network for AI (MONAI) (v0.3.0) framework. Model training and evaluation was performed on a Linux workstation using an NVIDIA RTX 2080 Ti GPU. All code is publicly available under https://github.com/OnofreyLab/liver-segm.

Model evaluation and statistical analysis

The two algorithms’ 3D liver segmentations were assessed qualitatively and compared quantitatively against the manual segmentations. To quantify segmentation performance, the Dice Similarity Coefficient (DSC) was calculated to measure overlap with the ground-truth. The worst-case segmentation surface accuracy of the algorithms’ liver segmentation to the ground-truth was evaluated by means of a Modified Hausdorff Distance (MHD). Here, the MHD was defined as the 95th percentile of the original Hausdorff Distance (HD) since HD was shown to be sensitive to outliers [54]. To assess average segmentation surface accuracy, the Mean Absolute Distance (MAD) of the output liver segmentation mask to the ground-truth was calculated. The units for MHD and MAD were calculated in voxels (for images with 2mm3 voxel spacing). Equations for the segmentation metrics can be found in the S1 File.

Descriptive statistics were calculated using the Python library SciPy (v1.5.2) and were reported as absolute and relative frequencies (n and %) for categorical variables, mean and standard deviation (SD) for normally distributed variables, or median and interquartile range (IQR) for not normally distributed variables. A Wilcoxon signed-rank test was used for statistical pairwise comparisons between the algorithms and a p-value <0.05 was considered significant.

Compliance with ethical standards

This HIPAA-compliant retrospective, single-institution study was conducted in accordance with the Declaration of Helsinki, and approval was granted by the Institutional Review Board of the Yale University School of Medicine with waiver of informed consent.

Results

Study population

From an institutional database of 629 HCC subjects, 219 subjects met the defined inclusion criteria. Population sample statistics are summarized in Table 1 and MR imaging parameters are summarized in the (S1 Table). Briefly, the study population comprised 170 male (77.6%) and 49 female (22.4%) subjects with an age distribution of 62.8±9.1 (mean±SD) years with treatment-naïve HCC. Thirty (13.7%) patients were staged as BCLC-0, 122 (55.7%) as BCLC-A, 32 (14.6%) as BCLC-B, 15 (6.8%) as BCLC-C, and 20 (9.1%) as BCLC-D.

thumbnail
Table 1. Demographic, radiological, and cancer staging sample statistics of the training, validation, and testing cohorts from 219 HCC patients included in this study.

https://doi.org/10.1371/journal.pone.0260630.t001

Data split

Each of the two training sets consisted of 129 patients: For the "Early-Intermediate-Stage-Net" (EIS-Net), the training set comprised of 21 (16.2%) BCLC-0, 86 (66.6%) BCLC-A, and 22 (17.1%) BCLC-B patients; the training set for the "All-Stage-Net" (AS-Net) comprised of 19 (14.7%) BCLC-0, 74 (57.3%) BCLC-A, 18 (14.0%) BCLC-B, 8 (6.2%) BCLC-C, and 10 (7.7%) BCLC-D patients. Both algorithms shared the same validation set comprised of 28 patients with the following BCLC stages: Four (14.3%) BCLC-0, 19 (67.8%) BCLC-A and 5 (17.9%) BCLC-B patients and were evaluated on the same test set consisting of 44 patients comprised by the following cancer stages: 5 (11.4%) BCLC-0, 17 (38.6%) BCLC-A, 5 (11.4%) BCLC-B, 7 (15.9%) BCLC-C, and 10 (22.7%) BCLC-D patients. Full details on sampling of the data sets can be found in the flowchart in Fig 2.

thumbnail
Fig 2. Inclusion and exclusion criteria, and partitioning of the dataset for model training and evaluation.

From an institutional database, 219 HCC patients that had arterial-phase MR images and a manual liver segmentation available for processing were included. Subjects from each BCLC stage were then allocated to the test set and patients were selected for shared validation and testing sets. From the overall training pool, subjects were sampled to create two training data subsets for the Early-Intermediate-Stage-Net (EIS-Net) and the All-Stage-Net (AS-Net).

https://doi.org/10.1371/journal.pone.0260630.g002

Model performance

Both the EIS- and AIS-net models were trained for 2000 epochs, at which time the loss function of the two models converged on both the training and validation datasets. The DSC (mean±SD) performance on the training datasets were 0.952±0.042 and 0.951±0.035 and on the validation dataset 0.928 ±0.093 and 0.928±0.093 for the EIS-Net and AS-Net, respectively. Segmentation of the validation and test set data was performed on the whole image using a large patch (224x224x128) in order to avoid stitching artifacts from smaller, overlapping patches. Segmentation times (median [IQR]) for both the EIS- and AS-Net were 0.73 [0.33] seconds and 0.70 [0.27] seconds on the validation and test set, respectively.

Qualitative assessment of the algorithms’ segmentation outputs on the test set across different BCLC stages showed that both the EIS-Net and the AS-Net performed well on early and intermediate BCLC stages (i.e., BCLC-0, BCLC-A, BCLC-B). However, the AS-Net outperformed the EIS-Net on more advanced stages (i.e., BCLC-C and BCLC-D). Examples of representative liver segmentations across BCLC stages are shown in Fig 3.

thumbnail
Fig 3. Example liver segmentations results across Barcelona Clinic Liver Cancer (BCLC) stages.

Rows from top to bottom show axial, sagittal and coronal arterial-phase magnetic resonance images of different subjects across BCLC stages (from left to right). The last row displays the liver segmentations as 3D renderings. The liver segmentation masks of the Early-Intermediate-Stage-Net (blue) and All-Stage-Net (orange), as well as the ground-truth (yellow) are overlaid on the images. While the Early-Intermediate-Stage-Net was trained only on patients with BCLC stages 0, A and B, the All-Stage-Net was trained on a training set spanning all BCLC cancer stages.

https://doi.org/10.1371/journal.pone.0260630.g003

Detailed assessments of the segmentation results showed that the EIS-Net failed on some advanced BCLC cancer stages with big HCC tumors, where large areas of hypointense necrotic tumor tissue were not classified as liver tissue. The AS-Net, by contrast, correctly classified those regions as liver tissue. In other cases, the EIS-Net incorrectly classified structures around the liver, such as parts of the small intestine or colon, as part of the liver, while the AS-Net correctly delineated the anatomical liver contour in those scans. Furthermore, in some patients, regions of large ascites surrounding the liver were classified as liver parenchyma by the EIS-Net, leading to large over-segmentation of the liver, whereas the AS-Net did not consider these areas to be part of the liver. Representative examples of better liver segmentation results of the AS-Net against the EIS-Net are shown in Fig 4.

thumbnail
Fig 4. Examples of the superior liver segmentation performance of the All-Stage-Net over the Early-Intermediate-Stage-Net.

Columns show results from five different subjects. Rows from top to bottom show axial, sagittal and coronal arterial-phase magnetic resonance images on which the All-Stage-Net (overlaid in orange) outperformed the Early-Intermediate-Stage-Net (overlaid in blue). Expert ground-truth liver segmentations are overlaid in yellow. The last row displays the liver segmentations as 3D renderings. White arrows point on areas of liver segmentation failure of the Early-Intermediate-Stage-Net. While the Early-Intermediate-Stage-Net was trained only on patients with Barcelona Clinic Liver Cancer (BCLC) stages 0, A and B, the All-Stage-Net was trained on a training set spanning all BCLC cancer stages.

https://doi.org/10.1371/journal.pone.0260630.g004

Quantitative analysis of the segmentation to the expert ground-truth showed (mean±SD) Dice Similarity Coefficients (DSC) for liver segmentations compared with manual segmentations of 0.946±0.032 and 0.954±0.018 for the EIS-Net and the AS-Net, respectively (p<0.0001). The Modified Hausdorff Distance (MHD) (mean±SD), measuring the closeness of the algorithms’ liver segmentation to the manual ground-truth, were 5.812±8.822 and 3.500±4.033 for the EIS-Net and AS-Net, respectively (p = 0.005). The Mean Absolute Distance (MAD) (mean±SD) for the liver segmentations compared with the expert segmentations were 1.243±1.901 for the EIS-Net and 0.750±0.370 for the AS-Net (p = 0.005). Further radiological assessment showed that a DSC of 0.95 between the ground-truth and the algorithms’ liver segmentation correlated well with the ground-truth.

When the models’ liver segmentation performances were compared across different BCLC stages, they did not differ significantly for the early and intermediate BCLC stages (DSC: p = 0.107, MHD: p = 0.413, MAD: p = 0.428) between both liver segmentation models. However, the AS-Net performed significantly better on advanced HCC stages (DSC: p<0.0001, MHD: p = 0.003, MAD: p<0.0001). Pairwise comparisons between the EIS-Net and AS-Net for each BCLC stage are shown in Table 2. Boxplots in Fig 5 show that the AS-Net had lower performance variance, better mean performance, fewer outliers and better worst-case performance than the EIS-Net on all BCLC stages across all quantitative segmentation metrics (DSC, MHD, MAD), indicating a more consistent and robust segmentation performance.

thumbnail
Fig 5. Liver segmentation method performance across different Barcelona Clinic Liver Cancer (BCLC) cancer stages.

The automatic liver segmentations of the Early-Intermediate-Stage-Net (EIS-Net) and All-Stage-Net (AS-Net) were compared quantitatively against the experts’ manual segmentations by means of the Dice Similarity Coefficient (DSC), Modified Hausdorff Distance (MHD), and Mean Absolute Distance (MAD). AS-Net showed better mean performance, fewer outliers and better worst-case performance across all segmentation metrics indicating a more robust segmentation performance. A Wilcoxon signed-rank test was used for pairwise comparisons between the liver segmentation algorithms and a p-value <0.05 was considered statistically significant (denoted with *, ns denotes no significant differences).

https://doi.org/10.1371/journal.pone.0260630.g005

thumbnail
Table 2. Liver segmentation performance (Dice Similarity Coefficient (DSC), Modified Hausdorff Distance (MHD), and Mean Absolute Distance (MAD)) of the EIS-Net and AS-Net methods compared to manual ground-truth across different Barcelona Clinic Liver Cancer (BCLC) cancer stages.

https://doi.org/10.1371/journal.pone.0260630.t002

In livers where HCC involved <50% of the parenchyma, the AS-Net outperformed the EIS-Net significantly with all performance measures (DSC: p = 0.005, MHD: p = 0.007, MAD: p = 0.046). In livers where ≥50% of the parenchyma was involved by tumor tissue, the AS-Net had significantly better results when the performances were compared by the DSC and MAD (p = 0.023 and p = 0.039, respectively). However, no statistical significance was found between the two algorithms for the MHD (p = 0.225).

When compared specifically for the extent of cumulative tumor diameter, the AS-Net and EIS-Net did not yield statistically significantly different results for tumors <3cm (DSC: p = 0.090, MHD: p = 0.385, MAD: p = 0.142). However, the AS-Net showed significantly better results than the EIS-Net for tumors ≥3cm (DSC: p = 0.002, MHD: p = 0.003, MAD: p = 0.018). Comprehensive pairwise comparisons between the two segmentation models for a range of different patient features can be found in the (S2S4 Tables).

Discussion

Accurate and robust whole liver segmentation is key for volumetry assessment to guide treatment decisions when deciding if various treatment options such as liver resection, radioembolization or portal vein embolization are safe [13, 15, 55, 56]. Moreover, liver segmentation is an important pre-processing step for subsequent cancer detection algorithms. Segmentation can be especially challenging in patients with cancer-related tissue changes and liver shape deformity as morphology can be substantially altered. To improve automated segmentation performance on MR images in patients with heterogeneous imaging characteristics across the full spectrum of primary liver cancer, a deep learning algorithm was trained using imaging data spanning the full distribution of BCLC staging.

In this study, we demonstrated that training across the distribution of BCLC stages significantly improved the ability of deep learning liver segmentation algorithms to generalize across cancer stages. Models trained using data across all BCLC stages yielded better and more consistent segmentation performance when compared to models trained only on early and intermediate cancer stages. Both the “Early-Intermediate-Stage-Net” (EIS-Net) and the “All-Stage-Net” (AS-Net) showed good segmentation results on livers with early and intermediate BCLC stages. However, the EIS-Net failed on the segmentation of some advanced BCLC stage patients on which the AS-Net showed robust segmentation results. Overall, training with diverse data reduced the variance in segmentation performance, making deep learning algorithms more robust and able to achieve greater performance consistency across a heterogeneous cohort of imaging data that is typically encountered in clinical practice.

Advanced liver cancer leads to heterogeneous liver tissue and significantly altered liver shapes [22, 23]. Moreover, multifocal and large tumors displaying voluminous areas of contrast-enhancement, tumor necrosis, infiltrative disease, perfusion abnormalities or tumor thrombi considerably change liver tissue morphology on MR images and therefore make it difficult for deep neural networks to correctly classify those areas as liver tissue. Additionally, the liver contour can be altered by a more cirrhotic configuration displayed as a more nodular surface, and with progressing liver failure and the development of portal hypertension, further alterations including large volume ascites [22, 57]. All these factors substantially change the liver morphology on MR images and make the segmentation task challenging.

We hypothesized that the AS-Net showed better performance on advanced BCLC stage patients since it had already seen much bigger tumors, heterogeneous liver tissue, and severe ascites in its training data. Interestingly, the AS-Net did not perform worse on earlier BCLC stages, even with fewer training data of those stages. Moreover, the diversity in the AS-Net’s training data helped the model generalize better on various HCC stages and showed less variance across all performance measures, indicating that the heterogeneity of cancer stages in the training data also helped to improve consistency by reducing distributional shift between the training and testing data [44]. The model also had better worst-case performance, indicating that the diversity of BCLC stages lead to more robust segmentation performance.

Many current state-of-the-art deep learning segmentation algorithms use encoder-decoder network architectures, and many practical improvements in segmentation performance can be realized through innovations in pre-processing [51], data augmentation and loss functions [29]. Previous liver segmentation studies have used the U-net architecture [19, 29, 32, 36, 37, 39] or its variants [34, 38]. The method of Bousabarah et al. [19] trained on 121 triphasic MR scans and tested on a set of 26 patients yielded a mean DSC of 0.91 (±0.01). The proposed fully convolutional neural network of Zeng et al. [34] used T2-weighted MR images and showed a DSC (mean±SD) of 0.952±0.01 on 51 validation patients. Wang et al.’s 2D U-net CNN for liver segmentation yielded a mean DSC of 0.95±0.03 with their method trained using 330 MRI and CT scans and tested on 100 T1-weighted MR images [32]. While our study’s goal was to determine the relative effect of different training data cohorts on segmentation model performance and not to focus on obtaining peak segmentation performance by exhaustively optimizing the network architecture, both of our models demonstrated segmentation performance comparable to that of previously published studies. Further performance gains may be realized with additional network tuning and model training strategies, and future work will involve accounting for distributional shifts during the model training process [58].

Our study has several limitations. First, the data for the staging of the patients of our database was collected retrospectively from the electronic health record of the hospital, and most patients in our database are distributed among earlier BCLC stages. Nevertheless, this distribution accurately reflects the clinical population at this site as most patients who undergo contrast-enhanced MR image acquisitions that require breath-holding are distributed across earlier BCLC stages and patients with more advanced disease and resultant poor performance status are unable to successfully complete the necessary instructions required for adequate MR image acquisition. Additionally, our data was limited to treatment-naïve HCC patients and did not include patients with other types of hepatic pathologies. Therefore, we were not able to investigate how treatment-associated changes of the liver parenchyma would affect the models’ segmentation performance. Future work will assess the performance of the algorithm on patients who underwent treatment and include a prospective evaluation of AS-Net using data from multiple sites, as well as verifying that our results hold across different network architectures.

Conclusion

In this paper, we demonstrate the training and validation of a fully automated 3D liver segmentation method using deep learning across the full spectrum of BCLC cancer stages. Our results show that diversity in the training data across all BCLC stages significantly improves the performance of robust whole liver MRI segmentation algorithms compared to the same algorithm trained with images representative of a limited subset of BCLC stages. To avoid problems caused by distributional shift and to ensure robust segmentation performance that is independent of liver shape deformation and tumor burden and generalizable across BCLC cancer stages, it is critical to train deep learning models on heterogeneous imaging data spanning all cancer stages and a diverse spectrum of diagnostic features. Moreover, we demonstrate the importance of model validation on datasets that are composed of a spectrum of cancer stages that exhibit heterogeneous diagnostic findings encountered in clinical practice.

Supporting information

S1 Table. Magnetic resonance imaging parameters.

Magnetic resonance imaging parameters of the training, validation, and testing cohorts from 219 HCC patients included in this study.

https://doi.org/10.1371/journal.pone.0260630.s001

(DOCX)

S2 Table. Dice Similarity Coefficient (DSC) results.

Dice Similarity Coefficient (DSC) results for the Early-Intermediate-Stage-Net (EIS-Net) and All-Stage-Net (AS-Net) compared against the experts’ manual segmentations.

https://doi.org/10.1371/journal.pone.0260630.s002

(DOCX)

S3 Table. Modified Hausdorff Distance (MHD) results.

Modified Hausdorff Distance (MHD) (in voxels) results for the Early-Intermediate-Stage-Net (EIS-Net) and All-Stage-Net (AS-Net) compared against the experts’ manual segmentations.

https://doi.org/10.1371/journal.pone.0260630.s003

(DOCX)

S4 Table. Mean Absolute Distance (MAD) results.

Mean Absolute Distance (MAD) (in voxels) results for the Early-Intermediate-Stage-Net (EIS-Net) and All-Stage-Net (AS-Net) compared against the experts’ manual segmentations.

https://doi.org/10.1371/journal.pone.0260630.s004

(DOCX)

References

  1. 1. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA Cancer J Clin. 2021;71(3):209–49. pmid:33538338
  2. 2. Siegel RL, Miller KD, Jemal A. Cancer statistics, 2019. CA Cancer J Clin. 2019;69(1):7–34. pmid:30620402
  3. 3. White DL, Thrift AP, Kanwal F, Davila J, El-Serag HB. Incidence of Hepatocellular Carcinoma in All 50 United States, From 2000 Through 2012. Gastroenterology. 2017;152(4):812–20.e5. pmid:27889576
  4. 4. Perz JF, Armstrong GL, Farrington LA, Hutin YJ, Bell BP. The contributions of hepatitis B virus and hepatitis C virus infections to cirrhosis and primary liver cancer worldwide. J Hepatol. 2006;45(4):529–38. pmid:16879891
  5. 5. Hamer OW, Schlottmann K, Sirlin CB, Feuerbach S. Technology insight: advances in liver imaging. Nat Clin Pract Gastroenterol Hepatol. 2007;4(4):215–28. pmid:17404589
  6. 6. Llovet JM, Brú C, Bruix J. Prognosis of hepatocellular carcinoma: the BCLC staging classification. Semin Liver Dis. 1999;19(3):329–38. pmid:10518312
  7. 7. Yau T, Tang VY, Yao TJ, Fan ST, Lo CM, Poon RT. Development of Hong Kong Liver Cancer staging system with treatment stratification for patients with hepatocellular carcinoma. Gastroenterology. 2014;146(7):1691–700. pmid:24583061
  8. 8. Okuda K, Ohtsuki T, Obata H, Tomimatsu M, Okazaki N, Hasegawa H, et al. Natural history of hepatocellular carcinoma and prognosis in relation to treatment. Study of 850 patients. Cancer. 1985;56(4):918–28. pmid:2990661
  9. 9. Leung TW, Tang AM, Zee B, Lau WY, Lai PB, Leung KL, et al. Construction of the Chinese University Prognostic Index for hepatocellular carcinoma and comparison with the TNM staging system, the Okuda staging system, and the Cancer of the Liver Italian Program staging system: a study based on 926 patients. Cancer. 2002;94(6):1760–9. pmid:11920539
  10. 10. Kudo M, Chung H, Osaki Y. Prognostic staging system for hepatocellular carcinoma (CLIP score): its value and limitations, and a proposal for a new staging system, the Japan Integrated Staging Score (JIS score). J Gastroenterol. 2003;38(3):207–15. pmid:12673442
  11. 11. Oken MM, Creech RH, Tormey DC, Horton J, Davis TE, McFadden ET, et al. Toxicity and response criteria of the Eastern Cooperative Oncology Group. Am J Clin Oncol. 1982;5(6):649–55. pmid:7165009
  12. 12. Child CG, Turcotte JG. Surgery and portal hypertension. Major Probl Clin Surg. 1964;1:1–85. pmid:4950264
  13. 13. Ribero D, Chun YS, Vauthey JN. Standardized liver volumetry for portal vein embolization. Semin Intervent Radiol. 2008;25(2):104–9. pmid:21326551
  14. 14. Mayer P, Grozinger M, Mokry T, Schemmer P, Waldburger N, Kauczor HU, et al. Semi-automated computed tomography Volumetry can predict hemihepatectomy specimens’ volumes in patients with hepatic malignancy. BMC medical imaging. 2019;19(1):20. pmid:30808320
  15. 15. Taner CB, Dayangac M, Akin B, Balci D, Uraz S, Duran C, et al. Donor safety and remnant liver volume in living donor liver transplantation. Liver Transpl. 2008;14(8):1174–9. pmid:18668669
  16. 16. Yamanaka J, Saito S, Fujimoto J. Impact of preoperative planning using virtual segmental volumetry on liver resection for hepatocellular carcinoma. World J Surg. 2007;31(6):1249–55. pmid:17440774
  17. 17. Abdalla EK, Adam R, Bilchik AJ, Jaeck D, Vauthey JN, Mahvi D. Improving resectability of hepatic colorectal metastases: expert consensus statement. Ann Surg Oncol. 2006;13(10):1271–80. pmid:16955381
  18. 18. Gruber N, Antholzer S, Jaschke W, Kremser C, Haltmeier M. A Joint Deep Learning Approach for Automated Liver and Tumor Segmentation. 13th International conference on Sampling Theory and Applications (SampTA). 2019:1–5.
  19. 19. Bousabarah K, Letzen B, Tefera J, Savic L, Schobert I, Schlachter T, et al. Automated detection and delineation of hepatocellular carcinoma on multiphasic contrast-enhanced MRI using deep learning. Abdom Radiol (NY). 2021;46(1):216–25. pmid:32500237
  20. 20. Guglielmi A, Ruzzenente A, Conci S, Valdegamberi A, Iacono C. How much remnant is enough in liver resection? Dig Surg. 2012;29(1):6–17. pmid:22441614
  21. 21. Gotra A, Sivakumaran L, Chartrand G, Vu KN, Vandenbroucke-Menu F, Kauffmann C, et al. Liver segmentation: indications, techniques and future directions. Insights Imaging. 2017;8(4):377–92. pmid:28616760
  22. 22. Huber A, Ebner L, Heverhagen JT, Christe A. State-of-the-art imaging of liver fibrosis and cirrhosis: A comprehensive review of current applications and future perspectives. European Journal of Radiology Open. 2015;2:90–100. pmid:26937441
  23. 23. Dodd GD 3rd, Baron RL, Oliver JH 3rd, Federle MP. Spectrum of imaging findings of the liver in end-stage cirrhosis: part I, gross morphology and diffuse abnormalities. AJR American journal of roentgenology. 1999;173(4):1031–6. pmid:10511173
  24. 24. Adams R, Bischof L. Seeded region growing. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1994;16(6):641–7.
  25. 25. Zhang X, Tian J, Xiang D, Li X, Deng K. Interactive liver tumor segmentation from ct scans using support vector classification with watershed. Annu Int Conf IEEE Eng Med Biol Soc. 2011;2011:6005–8. pmid:22255708
  26. 26. Huynh HT, Le-Trong N, Bao PT, Oto A, Suzuki K. Fully automated MR liver volumetry using watershed segmentation coupled with active contouring. Int J Comput Assist Radiol Surg. 2017;12(2):235–43. pmid:27873147
  27. 27. Lu F, Wu F, Hu P, Peng Z, Kong D. Automatic 3D liver location and segmentation via convolutional neural network and graph cut. Int J Comput Assist Radiol Surg. 2017;12(2):171–82. pmid:27604760
  28. 28. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44. pmid:26017442
  29. 29. Isensee F, Petersen J, Klein A, Zimmerer D, Jaeger PF, Kohl S, et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods. 2020;18(2):203–11. pmid:33288961
  30. 30. Tang X, Jafargholi Rangraz E, Coudyzer W, Bertels J, Robben D, Schramm G, et al. Whole liver segmentation based on deep learning and manual adjustment for clinical use in SIRT. Eur J Nucl Med Mol Imaging. 2020;47(12):2742–52. pmid:32314026
  31. 31. Chlebus G, Meine H, Thoduka S, Abolmaali N, van Ginneken B, Hahn HK, et al. Reducing inter-observer variability and interaction time of MR liver volumetry by combining automatic CNN-based liver segmentation and manual corrections. PloS one. 2019;14(5). pmid:31107915
  32. 32. Wang K, Mamidipalli A, Retson T, Bahrami N, Hasenstab K, Blansit K, et al. Automated CT and MRI Liver Segmentation and Biometry Using a Generalized Convolutional Neural Network. Radiol Artif Intell. 2019;1(2). pmid:32582883
  33. 33. Jansen MJA, Kuijf HJ, Niekel M, Veldhuis WB, Wessels FJ, Viergever MA, et al. Liver segmentation and metastases detection in MR images using convolutional neural networks. J Med Imaging (Bellingham). 2019;6(4):044003. pmid:31620549
  34. 34. Zeng Q, Karimi D, Pang EHT, Mohammed S, Schneider C, Honarvar M, et al. Liver Segmentation in Magnetic Resonance Imaging via Mean Shape Fitting with Fully Convolutional Neural Networks. MICCAI 2019 Lecture Notes in Computer Science. 2019;11765:246–54.
  35. 35. Takenaga T, Hanaoka S, Nomura Y, Nemoto M, Murata M, Nakao T, et al. Four-dimensional fully convolutional residual network-based liver segmentation in Gd-EOB-DTPA-enhanced MRI. Int J Comput Assist Radiol Surg. 2019;14(8):1259–66. pmid:30929130
  36. 36. Elghazy HL, Fakhr MW. Multi-Modal Multi-Stream UNET Model for Liver Segmentation. 2021 IEEE World AI IoT Congress (AIIoT). 2021:28–33.
  37. 37. Winther H, Hundt C, Ringe KI, Wacker FK, Schmidt B, Jurgens J, et al. A 3D Deep Neural Network for Liver Volumetry in 3T Contrast-Enhanced MRI. RoFo: Fortschritte auf dem Gebiete der Rontgenstrahlen und der Nuklearmedizin. 2021;193(3):305–14. pmid:32882724
  38. 38. Villarini B, Asaturyan H, Kurugol S, Afacan O, Bell JD, Thomas EL. 3D Deep Learning for Anatomical Structure Segmentation in Multiple Imaging Modalities. 2021 IEEE 34th International Symposium on Computer-Based Medical Systems (CBMS). 2021:166–71.
  39. 39. Liu M, Vanguri R, Mutasa S, Ha R, Liu YC, Button T, et al. Channel width optimized neural networks for liver and vessel segmentation in liver iron quantification. Comput Biol Med. 2020;122:103798. pmid:32658724
  40. 40. Jimenez-Pastor A, Alberich-Bayarri A, Lopez-Gonzalez R, Marti-Aguado D, Franca M, Bachmann RSM, et al. Precise whole liver automatic segmentation and quantification of PDFF and R2* on MR images. European radiology. 2021. pmid:33768292
  41. 41. Heidari M, Taghizadeh M, Masoumi H, Valizadeh M. Liver Segmentation in MRI Images using an Adaptive Water Flow Model. Journal of Biomedical Physics and Engineering. 2021;11(4):527–34. pmid:34458200
  42. 42. Guerra J, Mustafa M, Pandeva T, Pinto F, Matthies P, Brosch-Lenz J, et al. Performance of automatic Liver Volumetry for Selective Internal Radiotherapy. Nuklearmedizin. 2021;60(02):V71.
  43. 43. Yamashita R, Nishio M, Do RKG, Togashi K. Convolutional neural networks: an overview and application in radiology. Insights Imaging. 2018;9(4):611–29. pmid:29934920
  44. 44. Castro DC, Walker I, Glocker B. Causality matters in medical imaging. Nat Commun. 2020;11(1):3673. pmid:32699250
  45. 45. Deussen P, Assion F, Abel B, Abrecht S, Benecke AG, Besold TR, et al. DIN SPEC 92001–1; Artificial Intelligence—Life Cycle Processes and Quality Requirements—Part 1: Quality Meta Model. Beuth Verlag GmbH. 2019.
  46. 46. Chernyak V, Fowler KJ, Kamaya A, Kielar AZ, Elsayes KM, Bashir MR, et al. Liver Imaging Reporting and Data System (LI-RADS) Version 2018: Imaging of Hepatocellular Carcinoma in At-Risk Patients. Radiology. 2018;289(3):816–30. pmid:30251931
  47. 47. Fedorov A., Beichel R., Kalpathy-Cramer J., Finet J., Fillion-Robin J-C., Pujol S., et al. 3D Slicer as an Image Computing Platform for the Quantitative Imaging Network. Magn Reson Imaging. 2012;30(9):1323–41. pmid:22770690
  48. 48. Kerfoot E, Clough J, Oksuz I, Lee J, King AP, Schnabel JA. Left-Ventricle Quantification Using Residual U-Net. Statistical Atlases and Computational Models of the Heart Atrial Segmentation and LV Quantification Challenges STACOM 2018 Lecture Notes in Computer Science. 2019;11395:371–80.
  49. 49. Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI 2019 Lecture Notes in Computer Science. 2015;9351:234–41.
  50. 50. He K, Zhang X, Ren S, Sun J. Identity mappings in deep residual networks. European conference on computer vision 2016 Oct 8. 2016;9908:630–45.
  51. 51. Onofrey JA, Casetti-Dinescu DI, Lauritzen AD, Sarkar S, Venkataraman R, Fan RE, et al. Generalizable Multi-Site Training and Testing Of Deep Neural Networks Using Image Normalization. Proc IEEE Int Symp Biomed Imaging. 2019:348–51. pmid:32874427
  52. 52. Milletari F, Navab N, Ahmadi S. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. 2016 Fourth International Conference on 3D Vision (3DV). 2016:565–71.
  53. 53. Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:14126980. 2014.
  54. 54. Taha AA, Hanbury A. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool. BMC medical imaging. 2015;15:29. pmid:26263899
  55. 55. Vouche M, Lewandowski RJ, Atassi R, Memon K, Gates VL, Ryu RK, et al. Radiation lobectomy: time-dependent analysis of future liver remnant volume in unresectable liver cancer as a bridge to resection. J Hepatol. 2013;59(5):1029–36. pmid:23811303
  56. 56. Theysohn JM, Ertle J, Muller S, Schlaak JF, Nensa F, Sipilae S, et al. Hepatic volume changes after lobar selective internal radiation therapy (SIRT) of hepatocellular carcinoma. Clin Radiol. 2014;69(2):172–8. pmid:24209871
  57. 57. Tonan T, Fujimoto K, Qayyum A. Chronic Hepatitis and Cirrhosis on MR Imaging. Magnetic Resonance Imaging Clinics of North America. 2010;18(3):383–402. pmid:21094446
  58. 58. Subbaswamy A, Adams R, Saria S. Evaluating Model Robustness and Stability to Dataset Shift. Proceedings of The 24th International Conference on Artificial Intelligence and Statistics. 2021;130:2611–9.