Post-chemoradiotherapy FDG PET with qualitative interpretation criteria for outcome stratification in esophageal squamous cell carcinoma

Objectives Post-chemoradiotherapy (CRT) FDG PET is a useful prognosticator of esophageal cancer. However, debate on the diverse criteria of previous publications preclude worldwide multicenter comparisons, and even a universal practice guide. We aimed to validate a simple qualitative interpretation criterion of post-CRT FDG PET for outcome stratification and compare it with other criteria. Methods The post-CRT FDG PET of 114 patients with esophageal squamous cell carcinoma (ESCC) were independently interpreted using a qualitative 4-point scale (Qual4PS) that identified focal esophageal FDG uptake greater than liver uptake as residual tumor. Cohen’s κ coefficient (κ) was used to measure interobserver agreement of Qual4PS. The Kaplan-Meier method and Cox proportional hazards regression analyses were used for survival analysis. Other criteria included a different qualitative approach (QualBK), maximal standardized uptake values (SUVmax3.4, SUVmax2.5), relative change of SUVmax between pre- and post-CRT FDG PET (ΔSUVmax), mean standardized uptake values (SUVmean), metabolic volume (MV) and total lesion glycolysis (TLG). Results Overall interobserver agreement on the Qual4PS criterion was excellent (κ: 0.95). Except the QualBK, SUVmax2.5, and TLG, all the other criteria were significant predictors for overall survival (OS). Multivariable analysis showed only Qual4PS (HR: 15.41; P = 0.005) and AJCC stage (HR: 2.47; P = 0.007) were significant independent variables. The 2-year OS rates of Qual4PS(‒) patients undergoing CRT alone (68.4%) and patients undergoing trimodality therapy (62.5%) were not significant different, but the 2-year OS rates of Qual4PS(+) patients undergoing CRT alone (10.0%) were significantly lower than in patients undergoing trimodality therapy (42.1%). Conclusions The Qual4PS criterion is reproducible for assessing the response of ESCC to CRT, and valuable for predicting survival. It may add value to response-adapted treatment for ESCC patients, and help to decide whether surgery is warranted after CRT.


Introduction
Esophageal cancer is the sixth leading cause of cancer-related mortality worldwide, and the 5-year survival rate rarely exceeds 40% [1]. Most patients with esophageal cancer have advanced disease at the initial diagnosis, and are treated with neoadjuvant chemoradiotherapy (CRT) as the standard therapy [2]. A robust stratification of patient responses to CRT based on non-invasive tools has not yet been well developed. After neoadjuvant treatment, neither clinical parameters nor endoscopic ultrasonography or CT scans can reliably predict outcome. The post-CRT FDG PET, however, has emerged as a promising predictor of long-term survival, and it can be used to tailor individualized treatment for poor responders after neoadjuvant treatment [3]. Patients whose FDG PET results were a complete response might not benefit from added resection given their excellent outcomes without resection [4]. One study [5] reported that the pooled hazard ratio (HR) for a complete metabolic response (CMR) versus no response for OS was 0.51 (95% confidence interval [CI], 0.40-0.64) and for disease-free survival was 0.47 (95% CI, 0.38-0.57), respectively. Despite its utility for predicting outcomes, the lack of uniform and reliable criteria for post-CRT FDG PET interpretation appears to be the major drawback to using the reported criteria universally. Methods to improve the predictive value of PET include a qualitative approach, e.g., comparing the tumors with healthy surrounding tissue [6]; and quantitative approaches, e.g., comparing standardized uptake values (SUVs) with reported optimum SUV cut-off values, which vary from 2.5 to 4.5 [4,7,8], or comparing the relative reduction in SUV between pre-and post-CRT FDG PET (ΔSUV) with reported optimum cut-off values, which vary from 35% to 70% [9][10][11]. Wide ranges of sensitivities and specificities have been reported. The variations appear to depend upon the different sets of criteria-which are a matter of ongoing debate-used for FDG PET interpretation. Using a qualitative interpretative criterion for response assessment of FDG PET is well established and internationally recognized as the standard of care in FDG-avid lymphoma (referred to as Deauville criteria) [12] and useful in other malignancies including head and neck cancer [13], lung cancer [14] and cervical cancer [15]. Similar harmonization guidelines for interpretive criteria are needed for esophageal cancer to compare results from different studies, to perform multicenter trials, and to assist clinical practice in different sites. We developed a simple, qualitative, interpretive criterion of FDG PET to assess esophageal cancer therapy. We validated its reader reproducibility, determined its value for predicting survival, and compared it with other visual-based and quantitative SUV-based assessment criteria.

Inclusion and exclusion criteria
Inclusion criteria for the study were (a) histopathology-confirmed ESCC between January 2011 through December 2014, and (b) having undergone a post-therapy assessment FDG PET after the patient had completed CRT at our hospital. Exclusion criteria were (a) prior treatment for ESCC, (b) a history of other malignancies, or (c) post-therapy assessment FDG PET done more than 6 months after the patient had completed CRT. The Institutional Review Board approved this retrospective study (IRB #: 201700267B0).

Treatment and follow-up
The CRT consisted primarily of two cycles of 5-fluorouracil/cisplatin-based chemotherapy and thoracic radiation (42-66 Gy). Trimodality therapy included a post-CRT esophagectomy, which was usually scheduled 2-4 months after the patient had completed CRT. The 7th American Joint Committee on Cancer (AJCC) staging system was used to evaluate all patients, and all were followed-up until September 2016 or until death.

FDG PET imaging and analysis
FDG PET scans (Discovery ST PET/CT system; GE Healthcare, Waukesha, WI, USA) were principally begun one hour after the patients, who had fasted for at least 6 hours, had been injected with 370-555 MBq of FDG. Unenhanced CT scans were acquired first for attenuation correction and imaging fusion, and then PET scans (5 min/bed) from the skull to the midthigh were done. The PET images were reconstructed to a resolution of 5.47 × 5.47 × 3.27 mm using an ordered subsets expectation maximization algorithm. The reconstructed images were displayed in transaxial, coronal, and sagittal planes, and as a maximum intensity projection for interpretation.
For each PET dataset, the SUV max was defined as the highest SUV within hypermetabolic tumor boundaries. The SUV max reduction rate, i.e., the percentage reduction of the primary tumors' SUV max from pre-CRT FDG PET to post-CRT FDG PET, was calculated using the formula: Qualitatively, the post-CRT FDG PET was scored using the Qual 4PS qualitative 4-point scale on the esophageal tumor.
• Score 2: Focal FDG uptake greater than that in the surrounding tissue or in mediastinal blood pool, but not greater than that of the liver.
• Score 3: Diffuse FDG uptake greater than that in the mediastinal blood pool up to marginally greater than that of the liver.
• Score 4: Focal FDG uptake substantially greater than that of liver.  (a) Score 1: no detectable focal uptake; (b) Score 2: focal FDG uptake greater than that in the surrounding tissue or in the mediastinal blood pool, but not greater than that of the liver; (c) Score 3: diffuse FDG uptake greater than that in the mediastinal blood pool up to marginally greater than that of the liver, and suggestive of esophagitis; (d) Score 4: focal FDG uptake substantially greater than that of liver.
The FDG PET/CT images were retrieved from a picture archiving and communication system and were read by five experienced-25 years (NTC), 11 years (CCH), 10 years (YCH), 8 years (CJC), and 5 years (KWH)-board-certified nuclear medicine physicians at four hospitals. Blinded to patient histories and outcomes, the reviewers scored the scans independently to determine whether the reporting system would be reproducible between observers across different institutions. The final consensus result of negative or positive for a residual tumor was assigned at least 3 reviewers achieved a common agreement.
To compare the predictive value of each set of criteria, eight other criteria were also used. Qual BK used the same 4-point scale, but the cut-off level was changed to surrounding background uptake [6] to make Scores 1 and 3 negative and to make Scores 2 and 4 positive. SUV max3.4 [16] and SUV max2.5 [7] were different SUV max cut-offs in the primary tumor uptake on post-CRT FDG PET scans. ΔSUV max71.6% and ΔSUV max50% [11] used relative reduction of SUV max cut-off values in the primary tumor between pre-and post-CRT FDG PET scans. The mean standardized uptake values (SUV mean ), metabolic volume (MV), and total lesion glycolysis (TLG = SUV mean × MV) were also extracted for each primary lesion. The cut-off value of ΔSUV max 71.6%, SUV mean 2.4, MV 2.2, and TLG 4.99 was the median data of the primary tumor in our patients. The 9 criteria stratified patients into good-responders and poorresponders.
Association between survival outcome and FDG PET/CT imaging with negative [(Qual 4PS (-)] or positive [(Qual 4PS (+)] for residual primary tumor and negative (PET-CR) or positive (PET-nonCR) for malignant disease (i.e., including metastatic lesions) were analyzed to test the ability to tailor treatment for selective surgical resection. The final consensus result of Qual 4PS (+) was documented for PET-nonCR. Unexplained FDG-avid foci in lymph nodes or distant organs were reported positive for metastases and also documented for PET-nonCR. Exceptions included mediastinal nodal tracer uptake with calcification or high attenuation (>70 household units [HU]), or characterized by symmetric low-to-intermediate intensity FDG uptake in both pulmonary hilar regions with or without extending into the subcarinal and paratracheal nodal regions. All metastatic lesions had pathologically proved or followedup information.

Statistical analysis
Categorical variables were expressed as frequencies (%), and continuous variables as means ± SD or medians (IQR). The levels of agreement between five reviewers were analyzed using Cohen's κ. OS was defined as the period from the pathologically verified ESCC to the date of the last follow-up or death of the patient from any cause. The Kaplan-Meier method was used for survival analysis, and the difference between survival curves was analyzed using a log-rank test. Univariable and multivariable Cox proportional hazards regression analyses were used to identify independent predictors of OS. SPSS 17 for Windows (SPSS Inc., Chicago, IL, USA) was used for all statistical analyses. Significance was set at P < 0.05.

Patient characteristics
One hundred fourteen patients (mean age: 55.2 years; range: 32-80; 3 women and 111 men) were included in the study. The median follow-up was 33.2 months for living patients (range: 20.3-69.2 months). Most patients had an Eastern Cooperative Oncology Group (ECOG) performance status score of 1 (n = 98 [86.0%]), and most were in AJCC stage III (n = 93 [81.6%]). The mean SUV max of residual tumor uptake on post-CRT FDG PET was 3.8 ± 2.4. In 68 patients (60%) with available pre-CRT FDG PET scans, the mean SUV max of the pre-treatment tumor was 12.8 ± 6.7, and the evaluable median ΔSUV max was 71.6% (IQR: 47.9-82.7). Fortythree patients (37.7%) underwent trimodality therapy, and 71 (62.3%) underwent CRT alone (dCRT), including 10 with a salvage esophagectomy between 190 and 595 days post-CRT. The median interval between the date of the post-CRT FDG PET and the esophagectomy was 27 days (IQR: 21-37) in the trimodality group. The time between the injection of FDG and the acquisition of PET images was 58.7 ± 6.6 minutes (range: 43-80 minutes). The demographic features of the patients are summarized in Table 1.

Agreement among reviewers
In the 114 post-CRT FDG PET scans, the rates of residual tumors categorized as positive (Score 4) by the five reviewers ranged from 33.3% to 35.1%, and was 34.2% in the final consensus. The agreement of Qual 4PS between paired reviewers for negative versus positive results was "excellent" (Cohen's κ: 0.923-0.961, Table 2). The overall reviewer agreement measured using Randolph's free marginal multirater kappa was 0.95 (95%CI: 0.91-0.99). Discordant classification occurred in only 6 patients (5.3%, 3 for 3:2 and 3 for 4:1) and their tumor SUV max ranged from 2.9 to 3.8. Nineteen of all study patients (16.7%) had an SUV max between 2.9 and 3.8, and the discordant classification accounts for 6 of these 19 tumors (31.6%) (Fig 2).  (-) or Qual 4PS (+)] and therapeutic management (dCRT or trimodality), our patients could be divided into four distinct subgroups with different OS rate. The 2-year OS rates were 68.4% for the Qual 4PS (-)/dCRT group, 62.5% for the Qual 4PS (-)/trimodality group, 42.1% for the Qual 4PS (+)/trimodality group, and 10.0% for the Qual 4PS (+)/dCRT group. The Qual 4PS (-)/ dCRT and Qual 4PS (-)/trimodality had equivalent OS rates. The Qual 4PS (-)/trimodality group had a nonsignificantly higher survival rate than did the Qual 4PS (+)/trimodality group. The Qual 4PS (+)/dCRT group had a significantly lower survival rate than did the other three groups (Fig 5a). These data indicated that dCRT or trimodality resulted in no significant difference of 2-year OS for patients with Qual 4PS (-) while trimodality might be a better choice for patients with Qual 4PS (+). Using PET-CR based on Qual 4PS for FDG PET scan to subgrouping patients, the 2-year OS rates were 74.8% for the PET-CR/dCRT group, 65.2% for the PET-CR/trimodality group, 40.0% for the PET-nonCR/trimodality group, and 14.8% for the PET-nonCR/dCRT group (Fig 5b). After excluding patients with stage IV, similar differences in survival between subgroups were also obtained (S1 Fig).

Discussion
We found that using the proposed Qual 4PS qualitative interpretation criterion for post-CRT FDG PET to assess treatment response of ESCC provided good predictive value for survival outcome and yielded excellent interobserver agreement between reviewers from different hospitals. Additionally, it might offer a guide for deciding on post-CRT surgery. FDG PET has been evaluated to optimize monitoring therapeutic response of ESCC and other malignancies. Widely available and easy-to-use SUV is the method of choice in most studies. However, the reliability of SUV measurement is affected by many factors, such as inter-scanner variability, calibration errors, image acquisition and reconstruction parameters, attenuation correction, scatter correction, respiratory motion, and the partial volume effect, all of which make proper comparisons between different cohorts problematic [17]. One aim of this study was to determine whether a qualitative scoring system is practical and sufficiently robust to enable standardization of reporting across different hospitals. The major problems of a qualitative visual interpretation are the necessity of suitable criteria for interpretation and of reproducibility between different observers. Using the liver cut-off as a reference target seemed appropriate for post-therapy FDG PET interpretation. After they had studied a serial example of the scoring system (Fig 1), five reviewers from four different hospitals showed excellent agreement (Randolph's kappa: 0.95). Various analytic models have been published for predicting outcome and optimal discrimination between responders and nonresponders by defining (a) a cut-off level for the residual tumor FDG uptake on the post-therapy scan, or (b) a percentage decrease of SUV level between pre-and post-therapy scans. There is, however, no consensus about which post-therapy FDG PET cut-off criteria are the best predictors of the outcomes of ESCC patients or which most accurately identify patients who benefitted from surgery. Jeong et al. [6] qualitatively defined PET-CR as a decreased FDG uptake to a level indistinguishable from that of the surrounding normal tissue. Moreover, SUV max levels � 3.0 [4] and < 2.5 [7] have been used quantitatively for PET-CR. All of them verified that post-therapy PET-CR was a significant independent predictor of improved outcomes for CRT. There has been growing interest in using ΔSUV as a robust method of measuring metabolic response for predicting good and poor outcomes [18]. A large part of these studies chose the median value of their cohort data for the cut-off. Thus, a wide range of SUV reduction cut-offs, from 35% to 70%, have been reported [9][10][11]. The variance suggests a lack of standardization and might be explained by factors such as the spectrum of disease severity, and differences in the clinical features and therapy of each selected patient group. In this study, we also tested (a) the adjusted cut-off of SUV max3.4 , because of its optimal ability to detect post-CRT viable residual tumors in our institution [16], and (b) the adjusted cut-off of ΔSUV max 71.6%, which was the median reduction value in this cohort. As expected, the adjusted cut-off of SUV max3.4 was better than the reported SUV max2.5 , and the adjusted cut-off of ΔSUV max71.6% was better than the reported ΔSUV max50% for predicting good and poor outcomes.
An optimal treatment strategy should balance improved survival with minimized therapyrelated morbidity, mortality, and quality-of-life deterioration. The necessity of surgical resection after CRT remains controversial. We found equivalent survival for patients in the CRTalone and trimodality groups, which was consistent with the results of other randomized trials in which most patients had ESCC [19,20]. Post-CRT esophagectomies are associated with an approximately 50% postoperative morbidity rate and a 10% postoperative mortality rate [21][22][23]. As in this cohort, the 1-and 2-month postoperative mortality rates were 9.3% and 14.0%, respectively. A non-invasive surrogate marker after CRT is needed to indicate that additional surgery can be delayed, or even omitted, or that is can be requested. An endoscopic biopsy for pathologic responses might not be the best predictor of outcomes after CRT in esophageal cancer; the association of PET-CR with outcomes is believed to be more clinically relevant [24,25]. Retrospective studies which evaluated the potential of PET response-adapted strategy to identify patients for whom surgery might be avoided reported that the OS of PET-CR patients treated with CRT alone were equivalent to those treated with trimodality therapy [4,26]. In a prospective multicenter study of 43 patients, tailoring treatment based on post-CRT FDG PET scans for selective surgical resection showed promising efficacy [27]. In our study, the additional post-CRT esophagectomy significantly improved the OS of post-CRT Qual 4PS (+) patients, but it did not significantly improve the OS of post-CRT Qual 4PS (-) patients. After CRT, using FDG PET with the Qual 4PS interpretation criterion might be useful for determining the need for additional surgery. Large randomized multicenter studies to further evaluate  this organ-preserving approach of individualized therapy are still required, however. We believe that the harmonious interpretation criterion "Qual 4PS " presented here is suitable for post-CRT FDG PET scans of future multicenter trials to identify patients who benefit from CRT and, therefore, have a favorable outcome.
The study was retrospective and thus prone to a selection bias. Our results are not sufficient enough to change routine clinical practice for all esophageal cancer and should be interpreted cautiously. Six of the 19 patients (31.6%) with an SUV max between 2.9 and 3.8 were given discordant classifications by the reviewers. For tumor uptake within the relatively challenging SUV max range, we recommend that multiple reviewers be required for reaching a comprehensive consensus. It is necessary to include prospective trials that evaluate FDG PET response based on the Qual 4PS criterion to predict outcomes of esophageal cancer and that is embedded in a randomized treatment algorithm.

Conclusions
The proposed Qual 4PS interpretation criterion of FDG PET as therapy assessment for ESCC has excellent interobserver agreement and provides good predictive value for survival outcome. It is comparable to, and even better than quantitative criteria with different cut-offs. It can provide important information about which patients will benefit from an esophagectomy after CRT.