Association of visual and quantitative heterogeneity of 18F-FDG PET images with treatment response in locally advanced rectal cancer: A feasibility study

Background and purpose Few tools are available to predict tumor response to treatment. This retrospective study assesses visual and automatic heterogeneity from 18F-FDG PET images as predictors of response in locally advanced rectal cancer. Methods This study included 37 LARC patients who underwent an 18F-FDG PET before their neoadjuvant therapy. One expert segmented the tumor from the PET images. Blinded to the patient´s outcome, two experts established by consensus a visual score for tumor heterogeneity. Metabolic and texture parameters were extracted from the tumor area. Multivariate binary logistic regression with cross-validation was used to estimate the clinical relevance of these features. Area under the ROC Curve (AUC) of each model was evaluated. Histopathological tumor regression grade was the ground-truth. Results Standard metabolic parameters could discriminate 50.1% of responders (AUC = 0.685). Visual heterogeneity classification showed correct assessment of the response in 75.4% of the sample (AUC = 0.759). Automatic quantitative evaluation of heterogeneity achieved a similar predictive capacity (73.1%, AUC = 0.815). Conclusion A response prediction model in LARC based on tumor heterogeneity (assessed either visually or with automatic texture measurement) shows that texture features may complement the information provided by the metabolic parameters and increase prediction accuracy.


Introduction
Advances in disease diagnosis and treatment have improved the outcome of Locally Advanced Rectal Cancer (LARC). Nonetheless, most therapeutic decisions are still based on the Tumor, Node and Metastasis staging system (TNM), together with the distal and circumferential resection margin [1][2][3][4][5][6]. LARC tumors are a highly diverse group of lesions that may exhibit different responses to the same treatment, even in the same stage [7]. Therefore, early identification of responders to neoadjuvant treatment (NAT) could facilitate the development of tailored cancer therapies [4].
Medical imaging tools such as metabolic 18 F-Fluorodeoxyglucose (FDG) PET imaging have become crucial in oncology for staging and treatment evaluation [8,9]. Over the past decades, 18 F-FDG PET semi-quantitative metabolic activity descriptors derived from the Standardized Uptake Value (SUV), such as SUVmean and SUVpeak have been clinically used due to their prognostic ability [10][11][12][13][14][15]. More recent research recalls the interest of other parameters such as Total Lesion Glycolysis (TLG) and Metabolic Tumor Volume (MTV). These metrics provide information about metabolic activity in the whole volume. Indeed, TLG and MTV prognosis accuracy has been reported to be significantly higher than that of SUV values [16][17][18].
Nevertheless, the prognostic capacity of these metabolic features, even when combined with volume descriptors, is very limited. Over the last years, tumor heterogeneity has shown to be an additional source of information related with both prognosis and survival. It can be hypothesized that heterogenous phenotypes in the macroscopic scale can be related to underlying tumor pathophysiology and thus capture tumor aggressiveness [19].
Recently, radiomics has emerged as a way of quantifying tumor heterogeneity captured by radiological scans. The limited size of the available datasets for these purposes presents a limitation for deep learning approaches. Indeed, the approach of radiomics is still primarily based on handcrafted features used for regular machine learning predictive modelling [20]. Different radiomic based Texture Analysis (TA) approaches have attempted to objectively capture heterogeneity information from 18 F-FDG PET imaging studies [21][22][23][24][25][26]. Several flaws affect the corroboration of texture analysis as a valuable predictor of therapeutic response, as most of the published studies up to date did not perform multivariate analysis of the texture analysis features nor performed a solid cross-validation [27]. Moreover, the quantitative texture features obtained are complicatedly related to both the pathophysiological tumor processes and the visual appearance of the images. This impairs a straightforward clinical usage of heterogeneity features.
The aim of this multi-disciplinary study was to retrospectively evaluate the predictive capacity of visually-and quantitatively assessed texture features in comparison with standard metabolic parameters. A visual assessment scale of tumor texture and an open-source and carefully revised workflow to automatize the texture analysis were introduced. Moreover, a multivariate analysis combined with cross-validation was applied to generate robust results and their clinical value was assessed.

Patients
Thirty-seven LARC patients, either cT3-4 or cN+ according to the American Joint Committee on Cancer (AJCC), were selected. The inclusion criteria, staging and follow-up have been reported elsewhere [28]. Patients underwent an 18 F-FDG PET/CT study before their treatment.
The study followed the recommendations of the Helsinki declaration and was approved by the Institutional Ethics Committee from Gregorio Marañon hospital. Signed informed consent from all patients was obtained and all images were anonymized.

Treatment
All the patients followed the following treatment regime: Neoadjuvant chemotherapy. Consisted in two FOLFOX cycles every two weeks. Each cycle consisted in Oxaliplatin 85mg/m 2 on day one, intravenous Leucovorin 200mg/m 2 on days one and two and intravenous 5-FU 400mg/m 2 on days one and two.
Chemoradiotherapy. Two weeks after both cycles of chemotherapy, patients had five to six weeks of chemoradiotherapy (CRT). Pelvic radiotherapy was performed at a cumulative dose of 45-50.4 Gy (1.8 Gy daily fractions). Oral chemotherapy consisted in Tegafur at 1,200 mg/day on days one to four. Radiotherapy conformal three-dimensional plans followed the International Commission on Radiation Units and Measurements (ICRU) specifications and were delivered with 15 MV photon beams.
Surgery. Six weeks after CRT, resection was performed. Six senior surgeons participated. No strict criteria for surgical procedure was present but appropriateness of the safe distal margin distances and total mesorectal excision was mandatory.
Postoperative chemotherapy. Adjuvant chemotherapy was selected consisting in either two FOLFOX cycles every two weeks or four to six cycles every four weeks of an intravenous 5-FU-370-425 mg/m 2 and Leucovorin 20-25mg/m 2 /day in days one to five.

Evaluation of treatment outcome
One pathologist examined all the resected specimens after NACT, CRT and surgery and evaluated the changes suffered after treatment following recommendations by Quirke et al. [30,31]. Specimens were staged according to the sixth edition of AJCC classification (ypTNM). The response to NAT was classified according to the tumor regression grade (TRG) scale [32]: TRG 0, no response; TRG 1, residual cancer outgrowing fibrosis; TRG 2, fibrosis outgrowing residual cancer cells; TRG 3, presence of residual cancer cells; and TRG 4, complete histopathological response. Applying this method, tumors were classified into NAT responders (TRG 3-4) or non-responders (TRG 0-2).

PET/CT image acquisition protocol
Patients underwent PET/CT imaging before any of their treatments started. All the PET studies were obtained in the Nuclear Medicine department from Clinical La Luz de Madrid using a dedicated Philips Gemini TF model (standard bore, 70 cm) PET/CT simulator with an axial field of view = 18 cm (reconstructed field of view: 25,6,57.6 or 67,6 cm), and spatial resolution = 4.7 mm full-width half maximum. The scanner was equipped with a high light output scintillator (LYSO) that has high sensitivity, improved energy resolution and achieves a faster system timing resolution of approximately 600 ps, enabling better time of flight measurement.
3D PET/CT scans with 16-slice CT (slice thickness 3mm, reconstruction slice thickness 1mm and interval 1.5 mm) were acquired through the pelvis from the anal verge to the iliac crests in all patients. CT data was not acquired using a low-dose protocol. A rectal cancer CT scan protocol was used for volumetric analysis [33]. The contrast agent was not administered in these CT acquisitions. There are no differences in CT acquisitions among patients.
Whole-body PET emission images were acquired 45 min after intravenous injection of 5 MBq of FDG per kilogram of body weight. After radiotracer injection, patients rested and were orally hydrated (>0.5 L of water). Patient preparation included fasting for at least 6 h before the scan. In the morning of the scan day, patients were given a cleansing enema. PET data were normalized (to correct the system response) and corrected for attenuation, scatter radiation, random coincidences, dead time and decay. PET studies were normalized with respect to the blood glucose level measured before FDG administration [22,23]. Reconstruction was performed using weighted ordered subsets expectation maximization (2 iterations and 16 subsets) followed by the application of a smoothing filter (0.5 Hanning) and trilinear interpolation. The PET scans had a voxel size of 4x4x4 mm and a matrix size of 144x144x87 voxels.

PET data analysis
The processing workflow of this study is summarized in Fig 1. One experienced nuclear medicine specialist, blinded to the pathological status of the patients, obtained a Volume of Interest (VOI) by segmenting the tumor with a threshold of 40% of the maximum activity. Features related to tumor metabolism were calculated using 3-D Slicer open-source software Version 4.0.0. Harvard University, Cambridge, (MA) [34] and the PET-indiC module (Ethan Ulrich, University of Iowa). For verification, a second nuclear medicine specialist examined the VOIs based on the abovementioned scale and agreed on the classification performed. Table 1 presents the complete list of metabolic parameters measured, classified into two sets: 1) standard clinical metabolic features (SUV Max , SUV Peak , Metabolic Tumor Volume  Table 1; 3a. The texture features-First order, local (Gray Level Co-occurrence Matrix (GLCM)), regional (Gray Level Run Length Matrix (GLRLM)).
(MTV) and Total Lesion Glycolysis (TLG)), referred to as CLINICAL, and 2) the complete set of metabolic variables, referred to as ALLMET. Characteristics referring to quartiles in glycolysis are the lesion glycolysis calculated from the respective quarter of the grayscale range of the tumour region.

Heterogeneity visual assessment
The same nuclear medicine specialist also classified all tumors in PET images according to two visual scales by examining the whole tumour volume (Fig 2). The 'heterogeneity' visual scale defined a zero (0) score when tumors were homogeneous in appearance or a one score (1) otherwise. The 'pattern' visual scale assigned a zero (0) score to nodular tumors and one (1) to multinodular or cavitated lesions. A second nuclear medicine expert examined the visual scoring and both found consensus on the final classification.

Heterogeneity automatic assessment
Tumor VOIs were discretized to 64 gray levels as a way of normalizing the data, allowing interpatient comparison using the following equation [23]: where V is the intensity of the resampled image, I represents the intensity of the original image and O is the set of voxels inside the VOI. The range of 64 gray values has been previously identified as a tradeoff between noise removal and information loss [23]. Different image texture definitions were used to obtain the heterogeneity automatic analysis ( Table 1). The complete set of automatic texture descriptors will be referred to as TEXTURE parameters. In a first step, global texture features were extracted. They consist on a set of first and higher order statistics extracted from the gray-level histogram allowing the quantification of overall global changes in intensity within the VOI.
Secondly, intensity variations were studied with second-order or local texture features by using the gray level co-occurrence matrix (GLCM) [35]. Six statistics explaining local intensity variations were selected from the 21 originally described [35], based on previous literature to define the smallest set of GLCM features able to capture texture information [35][36][37][38]. To capture changes in local intensity beyond direct neighbors and reduce noise in the measurements [38,39], these features were calculated in a patch-wise manner using square kernels of different sizes, selected after examining the images and estimating the distance between voxels that characterized the texture pattern: 1,3,5,7 and 9 pixels.
Finally, intensity changes were studied using third-order or regional texture features using the Gray Level Run Length Matrix. Ten different statistics capturing regional texture measures were obtained from this matrix.
All texture features were calculated for the whole tumour volume. The first-order and third-order texture features were calculated using the Heterogeneity-CAD module (Narayan, V. et al, Harvard Medical School), from 3D Slicer [40,41]. We used in-house developed software to obtain GLCM texture metrics with respect to tumor volume, according to previous guidelines [35]. The Python software is available upon request.

Immunohistochemistry staining and evaluation
A representative biopsy sample from each of the 37 patients was obtained for immunohistochemistry (IHC) procedures previous to the start of treatment. The standardization, preparation and staining was automatically done with a Dako Techmate device. Tumor sections were stained with commercially available monoclonal antibodies for key molecules in cancer. Some are involved in tumor growth, progression, proliferation, metastasis capacity and suppression (Namely, Vascular Endothelial Growth Factor Receptor-2 (VEGFR-2), Ki-67 protein, cyclooxygenase-2, E-cadherin and p-53 oncogene). Others are involved in cell apoptosis and growth (Namely, B-cell lymphoma-2 (BCL-2) and c-erb b2 oncogene).

Statistical analysis
Quantitative comparisons between responders and non-responders were carried out using the Mann-Whitney's U test for continuous variables and χ 2 test for discrete variables.
Correlation between features used for modelling is presented in S1 Fig. Stepwise multivariate binary logistic regression (Forward Wald's, p< 0,05 for feature inclusion) was used to assess the predictive ability of parameters extracted from pretreatment 18F-FDG PET regarding patient´s response to NAT. To better validate and avoid over-fitting, multivariate binary logistic models were evaluated using a k-fold cross-validation (k = 5) where 80% of the dataset was used as training data and the remaining 20% was used as validation set. Mean accuracy and mean area under the ROC curve (AUC) from all the runs were used to assess the accuracy in the prediction of response, and 95% confidence intervals are reported in both cases.
The relationship between PET quantitative parameters and IHC biomarkers was assessed by means of Pearson correlation coefficient.
For evaluating the correlation between visual scales and automatically computed texture metrics, principal components were extracted from the automatically computed features for each patient scan. Contribution of each individual feature into the principal components can be found in S2 Fig. Principal components with eigenvalue greater than one were used. The correlation of the principal components with the visual scale was evaluated with the Spearman non-parametric correlation test.

Baseline patient and tumour characteristics
No significant differences were found between treatment responders and non-responders in terms of clinical (age, gender, time between first PET scan and first NACT session, distance to anal verge and clinical staging risk group) and IHC characteristics (Table 2). Additionally, Fig  3 presents the most representative slice of each patient-selected to be the one containing SUVmax of the resampled scans used for the analysis.

Visual scales for tumor response assessment
The comparison between responders and non-responders in the visual heterogeneity and pattern scales (Table 3) yielded statistically significant differences between groups (χ 2 = 11.926, p = 0.003 in the case of visual heterogeneity and χ 2 = 7,423, p = 0.013 in the case of visual pattern, degrees of freedom (dof) were 35 in both cases). The 'heterogeneity' visual scale is a Accordingly, prognostic ability of the visual heterogeneity and pattern scales were statistically significant in the univariate binary logistic regression (p = 0.003 and p = 0.015, respectively). After cross validation, accuracy of prediction was 75,437±0,881% with an AUC of 0,759±0,009 for the visual heterogeneity scale and 69,268±0,890% with an AUC of 0,691±0,008 for the visual pattern scale.
When building a multivariate model to predict response including both visual scales, only visual heterogeneity remained statistically significant (p = 0.003). Furthermore, when the

Tumor response prediction using quantitative texture features
Responders and non-responders showed statistically significant differences for several metabolic and texture features (Table 2). Afterwards, we used multivariate binary logistic regression to study the response predictive ability of the metabolic and quantitative texture features. Multivariate binary logistic regression models were fed with the statistically significant factors that appeared for each set of variables (CLINICAL, ALLMET, TEXTURE, ALLMET-TEXTURE). Their corresponding ROC curves are shown in Fig 4. When fitting a model with the set of four metabolic features with reported prognosis capacity in previous literature (CLINICAL), only Total Lesion Glycolysis (TLG) resulted statistically significant (p = 0.0488). After cross validation, the model obtained an accurate prediction in 50,149±0,293% of the cohort, with an AUC of 0,685±0,010.

Correlation of quantitative PET parameters with biomarkers expression
To study the biological meaning of the PET quantitative parameters that remained significant in the tumor response prediction models above (TLG, Glycolysis Q1, GLCM Energy at distances of three, five and nine voxels and GLCM IDMN at distance of seven voxels), their

Correlation of visual and quantitative heterogeneity measurements
To study the relationship between the visual scales proposed and automatic texture, principal components were extracted from TEXTURE database. Ten principal components were obtained with a cumulative variance explained of 90,938%.The absolute value of the contribution of each radiomic feature to each principal component is presented in S2

Discussion
This study shows that tumor heterogeneity in 18F-FDG PET images can discriminate between histopathological responders and non-responders. This information can be of great interest when selecting the best approach for managing colorectal cancer patients as the treatment can be tailored accordingly. The correct identification of non-responders allows their NAT to be intensified. Also, the response prediction could guide optimization of the surgical approach by using less-aggressive alternatives, and even mild postoperative chemotherapy could be prescribed in these cases.

Predictive capacity of metabolic (SUV related) features
Uptake parameters from PET defining the tumor metabolism (SUV, MTV, and TLG) are the only features used clinically to evaluate the tumor aggressiveness and therapy effectiveness. The prediction capacity of these clinical parameters (CLINICAL) was analyzed to establish the reference level achieved. This reference was later compared to the prediction achieved with the new variables in order to address the relevance of texture parameters. Since results obtained for CLINICAL variables showed poor ability to predict response to NAT, additional parameters related to tumor uptake (ALLMET) were introduced, but only one variable (Glycolysis Q1) remained significant in the multivariate analysis. As shown in Table 2 and Fig 3, only patient information and PET uptake parameters show a poor ability to discriminate between responders and non-responders in this cohort.

Predictive capacity of texture features
Quantitative heterogeneity features (TEXTURE) seemed to outperform accuracy of metabolic descriptors (AUC of 0,815 and 0,694, respectively). It can be noted that the ability to predict the response is increased as compared with the reference parameters (CLINICAL). When texture is combined with metabolic features (ALLMET-TEXTURE), the AUC reaches 0,768. Given the confidence intervals obtained, the difference of prediction accuracy between using texture alone or combined with metabolic parameters is not significant. These results suggest that the use of texture features may be a promising approach to predict tumor response to NAT.
In the multivariate analysis, several local texture parameters-i.e., study of PET intensity differences in different neighborhoods capturing changes in uptake values of different localities of the tumour-showed significant association with tumor response. The parameters that were significantly associated with response in our model are consistent with others reported previously [27], although there is a high variability in the results obtained with local texture. Soussan et al. [21] and Tixier et al. [23] reported how GLCM features could predict tumor response to treatment in breast and esophageal cancer. Conversely, Lemarignier et al. [42] and Nakajo et al. [24] observed no relationship in the same types of cancer. This discrepancy can be due to a GLCM analysis [21,23,24,43] performed at only one-voxel distance, which is a parameter dominated by noise rather than by real intensity differences in this type of images [41,42]. In our work, the use of different distances chosen based on visual differences in intensities, GLCM characteristics showed higher ability to predict tumor response.
No significance was found neither in with global texture metrics (i.e gray-level histogram statistics capturing intensity changes across the whole lesion assuming tumour heterogeneity is well-mixed) nor with regional texture descriptors (i.e. Gray Level Run Length Matrix) in the multivariate analysis. These parameters have been reported to be associated with response and long-term outcome in several types of cancer. Tixier et al. [23] and Nakajo et al [24] concluded that regional texture descriptors showed better prognostic capacity in esophageal cancer than SUV parameters. Bundschuh et al. [26] reported that global texture features could assess response for patients with LARC.

Biological interpretation of the quantitative texture descriptors
One of the major concerns in radiomics resides in the biological meaning of the parameters used, as the physiological processes underlying texture analysis remain unclear [44]. In this line, we decided to study the correlation between quantitative features significant for response prediction and several key molecules in cancer.
It was shown how the texture features that are able to predict tumor response are significantly correlated with VEGFR and E-Cadherin. VEGFR expression has always been related with angiogenesis and vascular permeability, which are processes characteristic of more aggressive tumors [45]. Thus, this correlation seems to be coherent as new forming blood vessels create local spots and increase heterogeneity of PET images which can be captured by computer-vision quantitative textural features. Moreover, E-cadherin is associated with invasion and metastasis due to the detaching of cancerous cells from the epithelial lining [46]. The association of texture parameters with e-cadherin reinforces the relationship of local heterogeneity in PET with processes in the tumor vessels that may negatively impact tumor prognosis.
Regarding the relationship between because of the glucose metabolic basis of PET imaging, it is not surprising to find that metabolic features correlate with biomarkers related with tumor proliferation (Ki-67) [47] and growth (COX-2) [48]. Nevertheless, it is remarkable that in our series TLG, one of the widely clinically used metabolic parameters, is outperformed by Glycolysis Q1 both in prediction and in the relation with proliferation biomarkers. Glycolysis Q1 refers to the glycolysis calculated on the lower quartile of intensity values. Therefore this might suggest that regions with lower activity concentration-therefore higher Q1 -are related with tumours with lower proliferation rates.

Visual scores: Easy approach to clinical applicability of the findings
To our knowledge, one of the obstacles to use radiomic features clinically is the complicated relationship with visual appearance of tumors. Thus, we proposed and evaluated a visual classification of heterogeneity to bridge this gap.
Visual heterogeneity and pattern category showed significant association with response to treatment. When both visual scores were introduced, heterogeneity remained significant whereas pattern category did not. Visual scores were then combined with baseline metabolic parameters (CLINICAL). In this case, only heterogeneity scale remained significant in the multivariate model, further supporting the importance of heterogeneity for clinical stratification [19,23,49].
The correlation between the visual scores and quantitative metrics suggests that they are describing similar characteristics. This may aid in the usage of texture features in the clinical procedures as the mathematical texture descriptors can be better understood through their association with the visual score. Besides, this reinforces the necessity of introducing heterogeneity in the medical guidelines for cancer staging as it has clinical significance when evaluating a treatment. Indeed, some PET-derived metrics are already used in the classification and early response assessment of diseases such as lymphomas [50] and trends in PET-imaging feature extraction suggest other types of cancer may also benefit from them [51,52].
We acknowledge several limitations of the study. First, this is a retrospective study with a relatively small sample size without holdout test set available when training the prediction of response. 5-fold cross validation was used to report the findings as a way to reduce biases so our findings suggest significant association between PET parameters and treatment response in LARC but they need to be validated in larger cohorts before claiming any robust prognostic ability. Additionally, the reproducibility of the PET feature findings may depend on the scanner and software. Future guidelines for standardizing procedures remain to be established in the future [23]. Finally, the conclusions can only be applied to patients with LARC, so replication of the study in other pathologies is warranted.

Conclusion
In this paper, heterogeneity in PET images is shown to be of clinical relevance for the prediction of response to NAT in LARC patients and to have a significant association with key molecular biomarkers in cancer. The main results of this study show how a visual classification of heterogeneity and a further automatic assessment of heterogeneity using texture analysis could become an essential element in research or practical oncology procedures.
Prospective studies are needed to validate the inclusion of these heterogeneity-based metrics as a robust component of the multi-disciplinary approach for the prediction and modelling of response in rectal cancer. This could enable the development of tailored therapies that improve patient´s outcome.