Lung tumor segmentation methods: Impact on the uncertainty of radiomics features for non-small cell lung cancer

Constance A. Owens; Christine B. Peterson; Chad Tang; Eugene J. Koay; Wen Yu; Dennis S. Mackin; Jing Li; Mohammad R. Salehpour; David T. Fuentes; Laurence E. Court; Jinzhong Yang

doi:10.1371/journal.pone.0205003

Abstract

Purpose

To evaluate the uncertainty of radiomics features from contrast-enhanced breath-hold helical CT scans of non-small cell lung cancer for both manual and semi-automatic segmentation due to intra-observer, inter-observer, and inter-software reliability.

Methods

Three radiation oncologists manually delineated lung tumors twice from 10 CT scans using two software tools (3D-Slicer and MIM Maestro). Additionally, three observers without formal clinical training were instructed to use two semi-automatic segmentation tools, Lesion Sizing Toolkit (LSTK) and GrowCut, to delineate the same tumor volumes. The accuracy of the semi-automatic contours was assessed by comparison with physician manual contours using Dice similarity coefficients and Hausdorff distances. Eighty-three radiomics features were calculated for each delineated tumor contour. Informative features were identified based on their dynamic range and correlation to other features. Feature reliability was then evaluated using intra-class correlation coefficients (ICC). Feature range was used to evaluate the uncertainty of the segmentation methods.

Results

From the initial set of 83 features, 40 radiomics features were found to be informative, and these 40 features were used in the subsequent analyses. For both intra-observer and inter-observer reliability, LSTK had higher reliability than GrowCut and the two manual segmentation tools. All observers achieved consistently high ICC values when using LSTK, but the ICC value varied greatly for each observer when using GrowCut and the manual segmentation tools. For inter-software reliability, features were not reproducible across the software tools for either manual or semi-automatic segmentation methods. Additionally, no feature category was found to be more reproducible than another feature category. Feature ranges of LSTK contours were smaller than those of manual contours for all features.

Conclusion

Radiomics features extracted from LSTK contours were highly reliable across and among observers. With semi-automatic segmentation tools, observers without formal clinical training were comparable to physicians in evaluating tumor segmentation.

Citation: Owens CA, Peterson CB, Tang C, Koay EJ, Yu W, Mackin DS, et al. (2018) Lung tumor segmentation methods: Impact on the uncertainty of radiomics features for non-small cell lung cancer. PLoS ONE 13(10): e0205003. https://doi.org/10.1371/journal.pone.0205003

Editor: Yong Fan, University of Pennsylvania Perelman School of Medicine, UNITED STATES

Received: May 4, 2018; Accepted: September 18, 2018; Published: October 4, 2018

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Data Availability: All relevant data are within the paper.

Funding: The authors acknowledge financial support from the Cancer Prevention Research Institute of Texas (URL: http://www.cprit.state.tx.us/) grant under award number RP110562. J.Y. and L.C. are the authors that received this fund. The authors acknowledge financial support from the National Institutes of Health/National Cancer Institute (URL: https://cancercenters.cancer.gov/) through Cancer Center Support Grant under award number P30 CA016672. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Precision medicine aims to customize cancer treatment for an individual patient by considering combined knowledge (i.e., conventional factors such as age and sex, genetics, proteins, and others) [1,2]. Precision medicine seeks to completely characterize the tumor to determine optimal treatment based on patient-specific characteristics. In recent years, studies have shown that radiomics features have the potential to significantly improve our ability to stratify patients according to likely treatment response beyond conventional prognostic factors, thereby leading to truly personalized cancer care [3–7].

The generic workflow of radiomics studies includes four steps: (1) image acquisition, (2) tumor delineation, (3) feature extraction, and (4) feature analysis [8,9]. The tumor delineation can be drawn manually or generated with a semi-automatic tool. Once the tumor delineation has been established, radiomics features are extracted from the tumor-defined region within the image. Thousands of radiomics features can be calculated for one tumor, and each feature characterizes the tumor in a different way. For example, roundness is a radiomics feature that characterizes the tumor shape and can be used to predict how the tumor may spread out to nearby locations. Lastly, features are evaluated to see whether they correlate with prognostic or predictive factors. Features that are shown to be predictive are then used to build outcome models that help predict how a patient will respond to a treatment. For different diseases, different radiomics features can be selected for outcome modeling to predict likely treatment response.

Before radiomics features can be clinically useful, it is necessary to investigate and understand the uncertainties of radiomics features. One major source of uncertainty comes from the tumor delineation. To manually delineate the tumor precisely, in general, is difficult. Tumors often lay adjacent to other organs that share similar characteristics with the tumor, making it difficult to distinguish the true tumor boundary. Additionally, medical images are far from perfect, as they have limited resolution (limiting our ability to see very small objects) and can contain artifacts (features in an image that do not represent a real aspect of the imaged object). Physicians may interpret the tumor differently, depending on their training and experience [10]. In addition, the different software tools that physicians use to draw the tumor contours may also affect the results, depending on user familiarity with the tool. Because radiomics features are calculated from the delineated tumor, uncertainty in tumor delineation could propagate to the radiomics features.

Recent advances in computer-aided automatic and semi-automatic segmentation approaches have been shown to reduce the burden in manual delineation and lessen the inconsistency in tumor delineation [11,12]. To date, a small number of studies have been performed to relate this reduced uncertainty in tumor delineation to the quality and reproducibility of radiomics features [13–17].

In this current study, we examined three specific factors that can influence the uncertainty of radiomics features for both manual and semi-automatic segmentation methods: (1) intra-observer, (2) inter-observer, and (3) inter-software. Manual contours were generated by three independent physicians using MIM Maestro^TM (MIM Software Inc., Cleveland, Ohio, USA) and 3D-Slicer [18]. Semi-automatic contours were generated by three trained observers using the GrowCut algorithm from 3D-Slicer [11] and the Lesion Sizing Toolkit (LSTK) [19]. While the segmentation accuracy of LSTK has been evaluated [19,20], to our knowledge the reliability of radiomics features extracted from LSTK-generated contours has not been studied. Additionally, we evaluated whether manual software tools and semi-automatic software tools can be used interchangeably for generating contours for feature extraction. The purpose of this study can be summarized into two main objectives. The first objective was to identify a reliable segmentation tool that produces lung tumor segmentations that yield reliable and robust radiomics features for the same observer, across multiple observers, and across multiple software tools. The second objective was to identify a group of reliable radiomics features for non-small cell lung cancer (NSCLC) primary tumors.

Materials and methods

Patient data and CT image acquisition

For this study, we retrospectively obtained patient data for 10 patients with histologically verified NSCLC. The Institutional Review Board (IRB) at the University of Texas MD Anderson Cancer Centers approved the present retrospective study, and the requirement for informed consent was waived. The lung tumors included in this study had volumes ranging from 1.15 cm³ to 10.53 cm³. For each patient, breath-hold helical computed tomography (CT) scans were acquired with intravenous contrast. The CT scans were acquired on General Electric Healthcare CT scanners with a peak tube voltage of 120 kVp and tube currents ranging from 320 mAs to 570 mAs. Each scan was reconstructed with a slice thickness of 2.5 mm and pixel spacing between 0.635 mm and 0.977 mm. Fig 1 shows a coronal slice of each tumor to display the variety of tumor presentations and locations of this patient cohort.

Download:

Fig 1. Tumor presentations and locations.

A central slice of each tumor in the coronal view is displayed to show the variety in tumor locations, shapes and appearances of the patients used in this study. A single physician contour is displayed (red) to identify the tumor in each patient scan.

https://doi.org/10.1371/journal.pone.0205003.g001

Manual segmentation

Manual segmentations were performed by three radiation oncologists using two different software tools: MIM Maestro^TM (MIM Software Inc., Cleveland, Ohio) and 3D-Slicer (a free open-source software platform) [18]. Each physician manually segmented each of the 10 tumors using both manual software tools, following the RTOG 1106 contouring guideline [21,22]. This guideline recommends contouring the primary tumor volume on CT images using a standard lung window/level for distinguishing lung borders and using a mediastinal window/level for distinguishing borders adjacent to the mediastinum. This process was repeated twice at two different times, yielding two sets of contours (Fig 2). The time intervals between the two sets of contours for each physician were approximately 1 year for the first two physicians and 1 month for the third physician. In total, 120 manual tumor contours were generated (2 software tools × 3 observers × 2 contours × 10 tumors). For both manual software tools, tumors were contoured using a paintbrush tool (thresholding in 3D-Slicer) in a slice-by-slice fashion in the transverse plane. Physicians could observe and edit the tumor in the coronal and sagittal planes as well, when desired.

Download:

Fig 2. Schematic of the collection of manual and semi-automatic contours.

Each circle and triangle represent a single tumor contour. The time interval between contour set 1 and contour set 2 was 1 year for the contours represented by circles and 1 month for the contours represented by triangles.

https://doi.org/10.1371/journal.pone.0205003.g002

Semi-automatic tumor segmentation

Semi-automatic segmentations were generated using two different software tools: LSTK (a level-set algorithm available from an open-source toolkit) and GrowCut (a region growing algorithm implemented in 3D-Slicer). For the semi-automatic segmentations, three observers without formal clinical training were instructed to use the two semi-automatic tools to generate tumor segmentations. Verbal step-by-step instructions were given to each observer on using each software tool. After that, observers practiced using each software tool on three lung tumors (outside the study). The entire process took less than 15 minutes, with instruction lasting 5 minutes and practice lasting less than 10 minutes. Once observers felt comfortable with the software tool, the segmentations for this study were collected. The contouring process that was used for the manual contours was repeated for the semi-automatic contours for the same 10 tumors (Fig 2). The time interval between the two sets was 1 to 2 months for each observer to lessen memory effects. Other studies showed that 3 weeks between contouring runs are enough to mitigate the effects of memory [23].

For GrowCut, observers labeled foreground and background pixels with two clicks (Fig 3) in each view, totaling in at least six clicks per tumor case. If the tumor was attached to the chest wall or mediastinum, additional clicks at appropriate location are needed to help the algorithm differentiate the tumor from the chest wall or mediastinum. Once labels were established, the GrowCut algorithm was followed by manual editing of the GrowCut-generated contours. The editing process took up to 2 minutes for some tumor cases.

Download:

Fig 3. User inputs for initializing semi-automatic segmentation tools.

(A) LSTK requires the user to select a seed within the tumor (red) to initiate the segmentation algorithm. Defining the maximum tumor radius generates a 3D bounding box (green) centered about the seed, within which the segmentation result will be confined. (B) GrowCut requires the user to label foreground (blue) and background (yellow) pixels to initiate the segmentation algorithm. Once labels were established, the GrowCut algorithm was followed by manual editing of the GrowCut-generated contours. Note that only the transverse view is shown here. Observers also labeled foreground and background pixels in the coronal and sagittal planes for each tumor case.

https://doi.org/10.1371/journal.pone.0205003.g003

For LSTK, the only interaction was to pick a seed which is a user-selected voxel within the tumor (Fig 3). Defining the maximum tumor radius was optional; however, defining an appropriate maximum tumor radius might save computation time in running LSTK. The LSTK algorithm has several preset parameters that can affect the segmentation result. We used the initial physician manual contours to guide us in selecting these parameters. Detailed discussions regarding the algorithms of GrowCut and LSTK can be found in other publications [19,20].

Validating tumor segmentation accuracy

We validated the accuracy of each semi-automatic segmentation. A group-consensus contour was generated as the ground truth where the group-consensus contour is taken to be the intersecting tumor volume shared by a majority of experts [23–25]. In this study, the group-consensus contour consisted of the tumor region where at least four of the initial six manual physician contours overlapped. To assess the accuracy of each tumor segmentation, the Dice similarity coefficient (DSC) and Hausdorff distance (HD) were calculated between the group-consensus contour and each individual semi-automatic contour. The DSC quantifies the spatial overlap between two contours, while the HD quantifies the longest contour distance between the boundaries of two contours. While the DSC can detect incorrectly labeled voxels, the HD metric is better at detecting deviations (sharp spikes or tiny holes) that significantly alter the contour shape but do not substantially alter the volume.

Feature extraction

Features were calculated for all 240 tumor segmentations (120 manual + 120 semi-automatic). For this study, feature extraction was performed using the open-source Imaging Biomarker Explorer (IBEX) software [26]. A total of 83 features were calculated. We stratified the features into three main categories: geometric shape (SHP), intensity histogram (HIS), and texture (TXT). Co-occurrence matrix features (a subcategory of texture features) were calculated in four directions (0, 45, 90, and 135 degrees), and the final value was taken to be an average of these four directions to avoid directional bias [27]. A common pre-processing step used to refine contours before feature extraction is to remove voxels with intensity values for normal lung tissue, bone, or air that might be inside the tumor contour. Since the purpose of this study is to investigate the segmentation uncertainty on radiomics features, we omitted this step to adhere to the original segmentation. We also did not correct for pixel size [28] or perform smoothing [29] to avoid introducing other uncertainties to this study.

Feature reduction

One common approach for narrowing the feature set is to apply a combination of different methods in a sequential manner [9,14,15,30,31] to remove features that are non-informative or redundant. In the current study, we applied two steps to reduce the initial feature set of 83 features to 40 informative and non-redundant features. The first step was to remove features that did not vary across different patients. For a feature to be informative, it must exhibit a range of values across different patients [9,14]. In other words, it must have a wide dynamic range to differentiate patients. Because multiple contours were generated for each patient, the average feature value was calculated for each patient. Before calculating the normalized dynamic range (NDR) for each feature, the average values for each feature were rescaled (across the patients) to have a mean of 0 and a standard deviation of 1 using z-score normalization, so that features with values of different scales could be compared. The NDR for each feature, NDR_f, was calculated as: where is the maximum normalized average feature value across all patients and is the minimum normalized average feature value across all patients. Once the NDR is calculated for each feature, a cutoff value is chosen as a means to remove the least informative features. In general, the cutoff value is chosen arbitrarily and may be set to a higher or lower value [9,15]. For the second step, highly correlated features were removed. It is well known that many features are highly correlated [9]. To deal with this issue, we computed a correlation matrix to identify highly correlated features. In this step, Spearman correlation coefficients were computed to evaluate the correlation between all features.

Feature reliability analysis

In this study, we examined three specific factors that can influence feature reliability: intra-observer, inter-observer, and inter-software (Table 1). Intra-observer agreement is a reliability measure of repeatability, while inter-observer and inter-software agreement are reliability measures of reproducibility [32]. To assess feature reliability, intraclass correlation coefficients (ICCs) were calculated for each feature. There are ten different forms of the ICC [33] and selecting the appropriate form depends on the experimental setup. To assess intra-observer reliability, we used a one-way random-effects model where the tumor cases are a random effect. To assess inter-observer and inter-software reliability, we used a two-way mixed-effects model where the tumor cases are a random effect and the observers (for inter-observer) and the software tools (for inter-software) are a fixed effect. The specific ICC form used to assess each reliability relationship is shown in Table 1. The ICC values, which can range from values of -1 to values of 1, were stratified into four different classifications. ICC values less than 0.4, between 0.4 and 0.6, between 0.6 and 0.75, and greater than 0.75 represented the ICC bounds for the classifications of poor, fair, good, and excellent reliability [23].

Download:

Table 1. ICC formulas used to assess feature reliability.

https://doi.org/10.1371/journal.pone.0205003.t001

Correlation between ICC and CCC.

Concordance correlation coefficients (CCCs) were also calculated because other feature reliability studies have used the CCC metric in their analysis [14,29,34,35]. Spearman rank correlation coefficients and pairwise scatterplots were computed between the ICC and CCC estimates for each reliability relationship.

Identifying reliable feature categories.

For this part of the analysis, we wanted to determine whether a specific feature category (shape, histogram, texture) was significantly more reproducible than another feature category. For this determination, Wilcoxon rank sum test (aka Mann-Whitney test) values were computed between each feature category combination (e.g., shape versus histogram) for each ICC relationship.

Feature range analysis

For segmentations from each software tool, we calculated the feature range (inter-patient variability) across observers for each radiomics feature. First, we normalized each feature using z-score normalization. This allowed us to more easily compare and plot features on different scales. Each normalized feature, , was calculated as: where f_p,i is the feature for contour i from patient p, is the mean value for feature f for all contours from patient p, and σ_p,f is the standard deviation for feature f for all contours from patient p. Then we recorded the minimum and maximum normalized feature values for each segmentation method to assess the feature range of each segmentation method.