Automated clear cell renal carcinoma grade classification with prognostic significance

We developed an automated 2-tiered Fuhrman’s grading system for clear cell renal cell carcinoma (ccRCC). Whole slide images (WSI) and clinical data were retrieved for 395 The Cancer Genome Atlas (TCGA) ccRCC cases. Pathologist 1 reviewed and selected regions of interests (ROIs). Nuclear segmentation was performed. Quantitative morphological, intensity, and texture features (n = 72) were extracted. Features associated with grade were identified by constructing a Lasso model using data from cases with concordant 2-tiered Fuhrman’s grades between TCGA and Pathologist 1 (training set n = 235; held-out test set n = 42). Discordant cases (n = 118) were additionally reviewed by Pathologist 2. Cox proportional hazard model evaluated the prognostic efficacy of the predicted grades in an extended test set which was created by combining the test set and discordant cases (n = 160). The Lasso model consisted of 26 features and predicted grade with 84.6% sensitivity and 81.3% specificity in the test set. In the extended test set, predicted grade was significantly associated with overall survival after adjusting for age and gender (Hazard Ratio 2.05; 95% CI 1.21–3.47); manual grades were not prognostic. Future work can adapt our computational system to predict WHO/ISUP grades, and validating this system on other ccRCC cohorts.


Introduction
Clear cell renal cell carcinoma (ccRCC) is the most common malignant tumor of epithelial origin in the kidney [1]. For over 30 years, ccRCC was graded using the 4-tiered Fuhrman nuclear grading system which incorporates nuclear size, nucleolar prominence, and nuclear membrane irregularities. Diagnostic challenges can occur with the presence of other morphological features such as sarcomatoid or spindle cell pattern, when higher grade ccRCC show more eosinophilic staining in the cytoplasm, or other renal cancer histologic types (e.g. papillary RCC type1 and chromophobe RCC) exhibit clear cytoplasm [2,3]. The correct classification of ccRCC grade and stage is important for guiding clinical management, molecular-based therapies, and prognosis [4,5]. Fuhrman grade is widely accepted as a prognostic factor despite mediocre inter-observer agreement [6,7]. To improve inter-observer agreement, simplified 2or 3-tiered grading systems have been proposed. These simplified systems appear to retain prognostic ability similar to that of 4-tiered systems [8,9]. Recently, a new nuclear/nucleolar grading system, known as the World Health Organization (WHO)/International Society of Urological Pathology (ISUP) Grading Classification for RCC, was introduced [10]. Technological advances have enabled computational pathology to discover novel histomics features from whole slide images (WSIs) that may add diagnostic and/or prognostic information [11][12][13]. Computational pathology techniques can analyze cancer WSIs [14][15][16], including the detection of malignant RCC cells [17]. In this study, we developed an automated grading system to predict 2-tiered Fuhrman grade using ccRCC WSIs from The Cancer Genome Atlas (TCGA). Our specific aims were to establish a computational pipeline to extract nuclei histomics features, develop a model to predict 2-tiered ccRCC grade, and evaluate the prognostic efficacy of computer predicted grades.

Cases and grade assignment
TCGA ccRCC clinical data, including Fuhrman's grade (accessed June 2017), and hematoxylin and eosin (H&E) WSIs were retrieved for 395 cases [18,19]. TCGA ccRCC cases were contributed by seven participating medical centers. The TCGA Fuhrman's grade for each case is the consensus of at least two pathologists from the case's medical center. In order to identify tumor areas on each diagnostic WSI (i.e., regions of interest (ROIs)) for this computational pathology study, Pathologist 1 reviewed each WSI, identified an average of five ROIs for each case (Fig 1), and assigned a Fuhrman grade of 1 to 4 for each ROI. The highest grade among all the ROIs was the designated grade. Thus, each patient had two assigned grades: "TCGA grade" and "Grade by Pathologist 1". TCGA and Pathologist 1 grades were re-stratified into the 2-tiered grading system: low (grades 1 and 2) and high (grades 3 and 4).

Image processing and nuclei segmentation
ROIs (n = 1855) from 395 WSIs were extracted and split into 2000 pixel by 2000 pixel patches (Fig 2). Nuclei segmentation was performed using Fiji (ImageJ, National Institutes of Health) [20] and using our previously published workflow [14]. H&E patches were converted from the Red, Green, and Blue (RGB) color space to the Hue, Saturation, and Value (HSV) color space (i.e., binary patches; Fig 3). A nonlinear mapping approach was applied as preprocessing to handle the variation across H&E staining inconsistency [21]. The nuclei segmentation method consists of two steps: adaptive thresholding in each HSV color channel to identify nuclei regions from the background, and marker controlled watershed-based nuclei segmentation to separate touching and overlapping nuclei. We further applied morphological operations to fine-tune the segmentation of nuclei. Extracted nuclei of area less than 200 pixels or greater than 2000 pixels were excluded to improve the specificity of nuclear detection [14].

2D histomics feature extraction
For each patch, 72 nuclei histomics features were extracted: nine morphological features, 15 intensity-based features, and 48 texture-based features. Morphological features describe the shape and size variation of nuclei. Intensity features (first order statistical features) describe the distribution of color variation in the nucleus. Three color channels were analyzed: lightness from HSV color space, lightness from Lab color space, and Hematoxylin channel from H&E patterns and texture of pixel values. Two types of second order statistical features were computed: co-occurrence based features (n = 8) and run length based features (n = 8). Co-occurrence based features include correlation, cluster shade, cluster prominence, energy, entropy, Haralick correlation, inertia, and inverse difference moment [23]. Run length based features include gray-level non-uniformity, run-length non-uniformity, low and high gray-level run emphasis, short run low and high gray-level emphasis, and long run low and high gray-level emphasis [24]. Likewise, texture features were extracted from the three selected color channels, resulting in a total of 48 texture features. Feature formulas have been previously described [14,25].

Data summarization and selection of representative ROI
Data extracted at the patch level were summarized to the ROI level by calculating the median and median absolute deviation (MAD) (i.e., 144 summarized features). Some cases had multiple ROIs annotated with the highest grade. Thus, one ROI among the highest grade ROIs was selected to represent the case. To do so, the median of all ROIs with the highest grade was calculated, and the ROI with the smallest Euclidean-distance to the calculated median was chosen (Fig 4).

Developing the machine learning model to predict grade
Cases with concordant 2-tiered grade by TCGA and Pathologist 1 (n = 277) were used to develop the automated 2-tiered grading system. Concordant cases were spilt into a training set (n = 235; 85%) and held-out test set (n = 42; 15%; Fig 5). The sampling package, R, was used to select the 42 patients in the held-out test set based on grade, age, gender, and stage, ensuring that they were representative of the concordant cases. Histomics features were z-scored. Seven machine learning classification methods were explored to classify ccRCC cases into either low or high grade using nuclei histomics features [26,27] (Fig 5). All methods achieved similar area under the receiver-operator characteristic curves (AUC ROC; S1 Table). Lasso regression was the top performing method with a built-in feature selection capability. Lasso regression is one  type of linear regression with L1 regularization. The Lasso procedure uses L1 regularization penalty, which has the effect of shrinking the regression weights of the least predictive features to 0, thereby creating simpler models that are less prone to overfitting [28]. In the Lasso model, a hyper parameter λ determines the amount of the L1 regularization penalty applied. We decided to move forward to use Lasso to build our final classification model because it is computationally efficient and more interpretable compared to other machine learning methods such as deep learning. Lasso regression and its optimal hyper parameter selected the final list of histomics features most associated with grade. We evaluated its performance on the held out test set.

Survival analyses
The Lasso model was applied to predict the grade of the previously held out test set (n = 42) and cases with discordant grades (n = 118). These 160 cases were combined to create an extended test set to evaluate the prognostic capability (i.e., overall survival [OS]) of our predicted grade using crude and adjusted Cox proportional hazard models. The adjusted Cox models include patient age, gender, and cancer stage. TCGA treatment information was missing from 69% of the cases and thus was not included in the adjusted Cox models. Kaplan-Meier curves were plotted to visualize differences between the curves (survival package, R) [29].

Additional pathological review for discordant cases
The grades provided by TCGA may be assessed from ROIs other than the representative ROIs selected in our study. To obtain a fairer comparison between manual and predicted grades among the discordant cases, the representative ROIs were additionally reviewed by Pathologist 2.

Statistical analyses
Confusion matrices determined the concordance of the 2-tiered and 4-tiered grades between two raters [27]. Inter-rater reliability among three raters was evaluated using Fleiss' kappa. Boxplots were created using ggplot2 version 2.2.1. Comparisons between the nine morphological features with 2-tiered and 4-tiered grading were done using Mann-Whitney U or Kruskal Wallis test, respectively. All tests of statistical significance were two-sided. Statistical significance was achieved when p-value was <0.05 or when the false discovery rate (FDR) was <0.05. All analyses were conducted using R version 3.4.0.

Results
The majority of TCGA ccRCC cases were white males. Most participants were between the ages of 50 to 69 and had stage I disease ( and shape (i.e., roundness, elongation, flatness and major axis of ellipse fit) were significantly larger and less spherical in higher grades (FDR<0.05; S2 Fig and S2 Table).

Lasso classification model
The final Lasso model with the optimal λ at 0.0101 had an average ROC AUC of 0.84. The model predicted 2-tiered ccRCC grade with 83.3% accuracy (95% confidence interval (CI) 0.69-0.93), 84.6% sensitivity, 81.3% specificity, 18.8% false positive rate, and 15.4% false negative rate in the test set. The agreement between predicted and manual grades was good (frequency of agreement = 0.83, Cohen's kappa = 0.65). The 18 unique histomics features associated with ccRCC 2-tiered grade are in Table 2.

Prognostic efficacy of predicted grades
There were 65 death events out of 160 cases in the extended test set. Cases predicted as high grade had significantly poorer OS compared to low grade (Fig 6).

Comparing predicted grade with TCGA and Pathologist 1
Among the concordant cases, 2-tiered manual grades were significantly associated with OS ( Fig 7A; Table 3). Predicted grade for concordant cases were not evaluated as the majority of the concordant cases were part of the training set used to build the Lasso model. Within the discordant cases, neither grade provided by TCGA nor Pathologist 1 was associated with OS ( Fig 7B and 7C). Predicted grade was significantly associated with OS (crude model HR 2.01; 95% CI 1.14-3.54) and when adjusted for age and gender (HR 2.31; 95% CI 1. 26-4.24). The association of predicted grade and OS among the discordant cases was attenuated when adjusted stage was included in the model (HR 1.83; 95% CI 0.98-3.41; Fig 7D; Table 3).

Additional pathological review for discordant cases
There was no effective agreement between TCGA, Pathologist 1, and Pathologist 2 among the discordant cases (4-tiered grading: Fleiss' kappa = -0.23; 2-tiered grading: Fleiss' kappa = -0.33). When comparing between TCGA and Pathologist 2, there was no effective agreement (4-tiered grading: frequency of agreement = 0.33, Cohen's kappa = -0.14; 2-tiered grading: frequency of agreement = 0.39, Cohen's kappa = -0.19). Despite assessing the same representative ROIs, the agreement between Pathologist 1 and Pathologist 2 was poor for 4-tiered grading (frequency of agreement = 0.48, Cohen's kappa = 0.11) and slightly improved for 2-tiered grading (frequency of agreement = 0.61, Cohen's kappa = 0.20). Discordant cases between Pathologist 1 and Pathologist 2 were more likely to be assigned as high grade by Pathologist 2. Contingency tables between TCGA, Pathologist 1, and Pathologist 2 are in S3 Table. Grades assigned by Pathologist 2 were not associated with OS (Table 3). Further analyses were explored to determine if the incorporation of manual grade by Pathologist 2 may improve prognostic efficacy. The grades for discordant cases were re-assigned as low or high by using the most frequent grade among TCGA, Pathologist 1, and Pathologist 2, and among Pathologist 1, Pathologist 2, and the predicted grade (i.e., integrating manual and computer). Re-assigned grades were not associated with OS (p>0.05; S4 Table). Next, these cases were further divided into cases that did and did not agree between Pathologist 1 and Pathologist 2. Manual grades were not associated with OS in cases that did and did not agree between Pathologist 1 and Pathologist 2 (p>0.05; Table 4). Predicted grade was only associated with OS in cases that agreed between Pathologist 1 and Pathologist 2 ( Table 4). S1 File contains the manual and predicted grades of these ccRCC cases.

Discussion
This study utilized the large and diverse TCGA ccRCC dataset to extract quantitative histomics features from ROIs and applied a Lasso regression model to develop an automated 2-tiered grading system using 18 unique features (26 total features) which achieved an ROC AUC of 0.84. Using discordant cases as an independent validation set, our data-driven system stratified ccRCC cases into low and high grades that were significantly associated with OS. The prognostic efficacy of predicted grades in the discordant cases outperformed the manual grades assessed by TCGA, Pathologist 1, and Pathologist 2. This proof-of-concept study demonstrated the potential of computational pathology to predict ccRCC grades via a more objective and quantitative pipeline, as well as addressed the issue of grade disagreement commonly encountered between pathologists. The grading of ccRCC is highly challenging and subjective, but the accurate assignment of ccRCC grade is important for clinical care and follow-up. Research groups, specifically Yeh and colleagues [30], Kruk and colleagues [31], and Holdbrook and colleagues [32], have been actively developing computational pathology systems to provide objectivity and/or automate ccRCC grading. Each computational system is highly unique with differences in image processing, feature extraction, classification method, and predicting 2 or 4-tiered grades. We utilized an unbiased data-driven approach where we extracted a set of high dimensional nuclear features (n = 144), and used Lasso, a machine learning-based method, to build our final predictive model. This is different from Yeh et al. [30] who only evaluated 1 feature (i.e., maximum nuclei size) to predict 2-tiered grade, Kruk et al. [31] who pre-selected features (out of 31 features) prior to building the final model to predict 4-tiered grade, and Holdbrook et al. who used up to 4 concatenate feature vectors to calculate fraction value scores prior to classification into low or high grade [32]. In addition, our Lasso regression allowed us to identify the 18 unique histomics biomarkers in our final predictive model while the features in the models by Kruk et al [31] and Holdbrook et al. [32] are unknown. Our 18 features provided information about the nucleus, the uneven distribution of nucleus staining, and the granularity of chromatin and nucleoli, highlighting that the addition of computer textual and intensity-related features to traditional pathology morphological features can improve the ability to predict ccRCC grade. We and Holdbrook et al [32] demonstrated that our predicted grade had prognostic significance whilst the studies by Kruk et al [31] and Yeh et al [30] did not report if their grade was associated with prognosis. Lastly, our system was trained using a much larger and more diverse dataset of 277 cases from seven TCGA participating institutions, and we validated our Cases predicted as high grade have significantly poorer overall survival rates compared to cases predicted as low grade in the extended test set (hazard ratio 2.07, 95% confidence interval of 1.25-3.43, p<0.01; 65 death events among 160 cases). The shaded areas reflect the 95% confidence interval for high or low grade. system using 160 cases. This is in contrast to those three studies which used small numbers for training (n = 38 to 70) and validation (n = 6 to 62), and obtaining their cases from a single institution. Collectively, our work and others are substantial efforts to improve ccRCC grading. Each computational method will require further refinement and validation before their clinical utility can be determined.
Each TCGA grade is the consensus of at least two pathologists. One reason for grade disagreement between TCGA and Pathologist 1 can be explained by TCGA pathologists assessing different ROIs than the representative ROIs selected in our study. However, even when  Table 3 for hazard ratios and 95% confidence intervals for each analysis. The shaded areas reflect the 95% confidence interval for high or low grade.
https://doi.org/10.1371/journal.pone.0222641.g007 Automated 2-tiered ccRCC classification reviewing the same ROIs for discordant cases, there was very poor agreement between Pathologist 1 and Pathologist 2, reiterating the challenges of ccRCC grading. These discordant cases could be more diagnostically challenging or ambiguous. Since manual grades for concordant cases were significantly associated with OS, it could be argued that concordant cases were diagnostically less challenging where the tumors were overwhelmingly of a low or high grade, and Automated 2-tiered ccRCC classification that our model was trained using more homogeneous ROIs. Predicted grades for discordant cases were significantly associated with survival, in contrast to manual assessments or using the most frequent manual grade. Therefore, our automated system has the ability to diagnose a range of ccRCC cases with consistency and objectivity. In practical application, such computational system could be useful as a tool to provide a second-opinion in diagnostically ambiguous cases for pathologists. Our study has some limitations. We did not use the WHO/ISUP grading system because the TCGA participating medical centers used the Fuhrman's system. However, since our computer system was constructed based on computer extracted nuclear features, it can be adapted to predict WHO/ISUP grades which also utilize nuclei/nucleoli features in the future. There are inherent limitations of reviewing cases using WSIs. Accurate grading may be hindered by the quality of WSIs and the lack of the Z-axis [33]. Our study reviewed diagnostic WSIs and analyzed manually selected ROIs that may not be representative of the entire tumor. For future work, automating ROI detection and grade prediction will allow the review of multiple tumor sections more efficiently. Lastly, our nuclei segmentation relied on conventional image analysis techniques. While qualitative evaluation of the segmentation results revealed that our image processing pipeline produced reasonably good results, the nuclei segmentation may not be optimal in more challenging cases. A solution is to employ deep learning based techniques to improve nuclei segmentation in future studies [30,34,35].

Conclusions
We developed an automated 2-tiered Fuhrman's grading system with prognostic significance. Our system demonstrated the potential of computational pathology to improve the reproducibility in the diagnosis and grading of ccRCC, and to aid the clinical management of ccRCC patients. Future work may include adapting our computational system to predict WHO/ISUP grades; validating our system on other ccRCC cohorts; using deep learning methods to detect ROIs, segment nuclei and predict grade; and exploring whether histomics features can predict prognosis independently of grade. This work is one step toward developing an artificial intelligence system for diagnostic pathology.
Supporting information S1 Table. The average area under the receiver-operator characteristic curves (AUC ROC) for each machine learning method using the training set after 100 iterations of random 10% hold out. These methods were implemented using the glmnet and caret packages in R.