Inter-Reader Reliability of Early FDG-PET/CT Response Assessment Using the Deauville Scale after 2 Cycles of Intensive Chemotherapy (OEPA) in Hodgkin’s Lymphoma

Purpose The five point Deauville (D) scale is widely used to assess interim PET metabolic response to chemotherapy in Hodgkin lymphoma (HL) patients. An International Validation Study reported good concordance among reviewers in ABVD treated advanced stage HL patients for the binary discrimination between score D1,2,3 and score D4,5. Inter-reader reliability of the whole scale is not well characterised. Methods Five international expert readers scored 100 interim PET/CT scans from paediatric HL patients. Scans were acquired in 51 European hospitals after two courses of OEPA chemotherapy (according to the EuroNet-PHL-C1 study). Images were interpreted in direct comparison with staging PET/CTs. Results The probability that two random readers concord on the five point D score of a random case is only 42% (global kappa = 0.24). Aggregating to a three point scale D1,2 vs. D3 vs. D4,5 improves concordance to 60% (kappa = 0.34). Concordance if one of two readers assigns a given score is 70% for score D1,2 only 36% for score D3 and 64% for D4,5. Concordance for the binary decisions D1,2 vs. D3,4,5 is 67% and 86% for D1,2,3 vs D4,5 (kappa = 0.36 resp. 0.56). If one reader assigns D1,2,3 concordance probability is 92%, but only 64% if D4,5 is called. Discrepancies occur mainly in mediastinum, neck and skeleton. Conclusion Inter-reader reliability of the five point D-scale is poor in this interobserver analysis of paediatric patients who underwent OEPA. Inter-reader variability is maximal in cases assigned to D2 or D3. The binary distinction D1,2,3 versus D4,5 is the most reliable criterion for clinical decision making.


Introduction
Several large treatment studies for Hodgkin lymphoma (HL) employ interim FDG-PET during chemotherapy to tailor treatment intensity-either by treatment intensification in non-adequate responders or by treatment reduction in adequate responders [1][2][3][4][5]. In general a binary decision has to be reached classifying a PET scan as adequate (= PET negative) or inadequate (= PET positive) metabolic response.
Several respective classification schemes have been developed in lymphoma during the last years [5][6][7][8]. In 2009 the Deauville 5-point scale was published as an international consensus for assessing metabolic response to chemotherapy [8]. The Deauville consensus eliminated the size criterion and introduced the liver as a second reference structure in addition to the mediastinum (D1 = no enhanced uptake, D2 = uptake < = mediastinum, D3 = uptake > mediastinum but < = liver, D4 = uptake moderately > liver, D5 = uptake strongly > liver). Current studies interpret either scores D3,4,5 (RAPID e.g.) or only scores D4,5 (RATHL e.g.) as PET positive. In 2012 the expert group agreed that in general scores D1,2,3 can be considered as complete metabolic response, so that mostly scores D4,5 should be classified as PET positive [9].
In current trials the interim PET scan is usually scored in a central reference reading by one or several expert groups. The conformity of the interpretation between experts using the same scales is a precondition for the comparability of results from different trials. Demonstrating excellent agreement among trained experts could be a first step towards handing over to a local reading process and a use of the method in clinical routine outside of studies. An International Validation Study reported good concordance among reviewers in ABVD (Doxorubicin, Bleomycin, Vinblastine, and Dacarbacine) treated advanced stage HL patients for the discrimination between score 1,2,3 and score 4,5 only. Inter-reader reliability of the full scale is not well characterised.
In addition, from real time review of images for the large European Network in Paediatric Hodgkin Lymphoma C1 study in classical HL (EuroNet-PHL-C1) starting in 2007 we had the impression that there is a substantial proportion of difficult to interpret cases after treatment with 2 cycles of OEPA (Vincristine, Etoposide, Prednisone and Doxorubicin) which is a more intensive chemotherapy than ABVD.
The aim of the present reference reading study was to evaluate the inter-observer agreement among experts using the D-scale after 2 cycles of intensive chemotherapy with OEPA.
In detail, the following questions should be answered: The PET/CT scans were obtained in 51 trial sites in 9 European countries. The patients for this retrospective analysis were randomly selected irrespective of image quality, physiological uptake pattern like enhanced uptake in brown adipose tissue, use of i.v. contrast agent or technical specifications of the PET/CT scanner.

PET/CT acquisition
The scans were obtained according to the local protocol of the individual PET centres but were in agreement with the EuroNet-PHL-C1 PET-manual and the Guidelines for 18 F-FDG PET and PET-CT imaging in paediatric oncology of the European Association of Nuclear Medicine [10]. The administered 18 F-FDG activity was 260±116 MBq, resp. 4.9±1.4 MBq/kg (mean ± SD). The interval between 18 F-FDG injection and start of image acquisition was 73 ±20 min. Images were attenuation corrected using iterative reconstruction.

Reference readers (R), allocation of PET/CT data sets and study template
The five R (MH Austria, CK and MD Germany (combined reading), BM Poland, FM France and LC Bulgaria) are nuclear medicine experts with special expertise in the interpretation of PET/CT in Hodgkin's lymphoma. LC worked 3 years as R in the EuroNet PET/CT Core lab, FM, BM and MH received on-site training at the Core lab in Leipzig in the frames of the EAHC-funded Paediatric Hodgkin Network project, CK and MD are R in the PET/CT Core lab of the German Hodgkin lymphoma study group (HD 15-19 trials) and in long-term regular clinical contact with the EuroNet Core lab.
All PET/CT scans had been transferred in DICOM format via the Paediatric Hodgkin Network to a cloud server (TeleHERMES TM , HERMES Medical Solutions AB, Sweden) for real-time central review [11]. In the Core Lab in Leipzig the 200 PET/CT scans (baseline and interim PET/CT from 100 patients) were anonymised, provided with a new case number and saved in a special folder on the cloud server. No additional clinical information was added. The Rs were provided with right of access to the study folder on the Hermes cloud server, with a template for the recording of the evaluation results and a list to encode regions of tumour involvement (26 lymph node areas and 5 extra-nodal areas-spleen, liver, left and right lung, skeleton). All reviewers used the dedicated workstation of Hermes Medical Solutions, Sweden, for the evaluation of the PET/CT datasets. For the interpretation the software Hybrid Viewer was used allowing the interpretation of the interim PET/ CT in direct comparison with the co-registered identical slices of the baseline PET/CT. The following parameters had to be filled in by the reviewers: scan quality (good-0 points, limited-1 point, poor-2 points); activation of brown adipose tissue (no-0 points, moderate-1 point, strong-2 points); artefacts (no or list of affected regions); suspicion for inflammatory lesions (region numbers); Deauville score and region number of the region with highest residual tumour uptake (in case of no residual uptake only the score D1 was filled in), and-if applicable-of the regions with second or third highest residual tumour uptake. Deauville scoring was strictly visual without supportive SUV measurements. In addition, comments could be saved in a free-text column.
The readers agreed before starting analysis of interim PET scans that new lesions (not present in the baseline PET) were not interpreted as lymphoma residuals if the patient was responding to treatment in initially involved sites and that only focal enhanced FDG uptake in initially involved areas was interpreted as residual tumour.

Statistics
By design five nuclear physicians experienced in FDG-PET based response assessment in Hodgkin lymphoma independently scored the same 100 cases on the five category D scale. As the true D scores of the cases are not ascertainable we have focused on inter-rater agreement.
The most pertinent analysis answers a simple clinical question: If my primary nuclear physician scored my patient D score d-what is the chance that an independent second opinion assigns score d too? To answer this question we estimate the conditional probability of consensus within a random pair of readers who both look independently at the same random case and at least one of the readers assigns score d. Estimation follows Uebersax [12]. Note that by conditioning on at least one reader assigning score d, these estimates become less dependent on the underlying population (similar to sensitivity and specificity in the case of a diagnostic test and known gold standard). In addition we describe the overall probability of agreement of two randomly selected readers on a random case.
In the literature, inter-reader agreement is often analysed using kappa coefficients. Therefore we also include a kappa based analysis. We use Fleiss's kappa (appropriate for multiple readers) in a version that is identical to Cohan's kappa in the case of two readers. Note that conceptually kappa is not a measure of the degree of agreement, but measures degree of agreement beyond that expected based on the observed marginal score distributions of the rater [13]. Comparing kappa values across publications is problematic since kappa depends on the underlying population of cases [14].
The influence of image quality was investigated calculating measures of concordance for subgroups with good (0 quality point), slightly reduced (1 point) or reduced quality (> 1 point). The influence of FDG uptake in brown fat tissue was explored in a similar way.

Results
For full transparency the individual scores of the five readers in the 100 cases are available as S1 Table. Differences in distribution of D scores The distribution of assigned D scores varies among the different readers (Table 1). While readers R1, R2 and R4 detected no enhanced residual FDG uptake above background in about half of the patients, readers R3 and R5 found completely negative PET (D1) in only 10 or 14% of the cases, respectively. Interpretation of reader R5 shows a clear shift from D1 to D2 in comparison to the other readers, that of reader R3 from D1 and D2 to D3 and D4.

Proportion of cases with consensus
An exact agreement among all five readers on an exact D score was reached in 12/100 cases (7x D1, 1 x D2, 0 x D3, 3 x D4, 1 x D5), an agreement of at least four of the readers in 29 cases (12 x D1, 2 x D2, 7 x D3, 6 x D4, 2 x D5). Since the decision D1 vs. D2 is of less clinical importance (always considered to be PET negative) we checked the agreement after summarizing D1 and D2 to one group. Under this condition an agreement among all 5 readers was reached in 22 cases (18 of them with score D1/2), four readers agreed in 58/100 cases (43 of them with score D1/2). An agreement of all five readers on a positive or negative PET using a threshold between D2 and D3 was reached in 33/100 cases (18 negative and 15 positive), an agreement of at least four of the readers in 71 cases (43 negative and 28 positive). Using a threshold between D3 and D4, all five readers agreed in 72 cases (65 negative and 7 positive), four readers agreed in 88 cases (76 negative and 12 positive).  Table 2 summarizes measures of agreement for the full five point D-scale as well as for derived three and two category aggregated scales. For each scale we provide estimates of the overall probability of agreement of two randomly selected readers on a random case. Also estimates of the category-specific conditional probability of agreement given that at least one in a pair of readers scores a certain category are provided. In addition, global kappa coefficients and the range of the ten pair-wise kappa coefficients among the five readers are reported. For pair-wise kappa coefficients between readers see S2 The probability that two random readers assign exactly the same D score is only 42% overall. Specific probabilities of concordance are highest for the extreme values D1 or D5 (> 50%) and lowest for D2 (25%) and D3 (36%). Aggregation of the scale to only three categories (D1,2 vs. D3 vs. D4,5) increases the overall probability of concordance to 60%. However, the probability to agree on score D3 remains low (36%). The restriction of the scale to a binary decision (PET positive or negative) results in a further increase in probability to concord, with markedly better results when the threshold is made between D3 and D4 (86%) as compared to a threshold between D2 and D3 (67%). The concordance rate in the latter case profits from the large fraction of negative cases (D1,2,3: 92%), while the probability of concordance on a positive case (D4,5) is only 64%.

Measures of agreement
Kappa for agreement on exactly the same D score is globally only 0.24 ranging from 0.14 to 0.39 in pairs of reviewers. Limiting the interpretation to a binary decision D1,2 "negative" versus D345 "positive" improves the global kappa to 0.36. Pairwise kappa values range between 0.19 and 0.51. This broad range is mainly due to the systematically higher rate of D3 in reader R3. The Kappa values of the four other readers range between 0.39 and 0.51. Using a binary

Concordance of interpretation in different regions of the body
In order to explore whether there are regions with particular problems we analysed discrepancy patterns by region (Table 3) taking the frequency of initial involvement into account. To eliminate regional discrepancies which are caused by different interpretation of region borders, the 21 lymph node areas were aggregated to 8 larger lymph node regions: upper, lower neck and supraclavicular to "neck" (left and right), infraclavicular and axillar to "axilla" (left and right), nodes in upper, middle, lower mediastinum, both sides lung hilus and recessus phrenicocostalis to "mediastinum", upper and lower paraaortic nodes and nodes at spleen and liver hilus and intestinum to "abdomen", iliac and inguinal nodes to "pelvic" (left and right). In addition, 5 extranodal regions were evaluated: left and right lung, spleen, liver and skeleton. We used the three category D1,2 vs. D3 vs D4,5 scale in order to focus on clinically relevant decisions.
Many discrepancies occurred in the mediastinum which was initially involved in 88 cases and on which consensus of all readers was achieved in 40/100 cases only. 288 of 306 (94%) pairwise reader discrepancies differed only by one category suggesting grading ambiguity. The left neck was involved in 74 patients. Unanimous reader consensus was reached in 72 of 100 cases. Among the 1000 pairwise reader comparisons 849 were concordant. Among the 151 pairwise reader discrepancies 31 (21%) were major (i.e. clearly negative D1,2 versus clearly positive D4,5).
The right neck was involved in 73 patients. Unanimous reader consensus was reached in 84 of 100 cases. Among the 1000 pairwise reader comparisons 908 of 1000 were concordant. Among the 92 pairwise reader discrepancies 36 (39%) were major.The skeleton was involved in only seven patients, but this region stands out with the highest proportion of major reader discrepancies (22 major and 8 minor pairwise discrepancies).
In the other nine regions including axillary regions, spleen, abdomen and lung unanimous reader consensus rate was above 95%. These regions were typically negative in the interim PET. In these regions neither scoring nor interpretation/detection discrepancies played a relevant role.  Influence of image quality or brown adipose tissue (BAT) uptake Good quality was attested by all five readers to 58 of the interim PET/CT. For 25 scans only one of the readers mentioned reduced quality, for 17 scans reduced quality was stated by more than one reader. The probability to concord on the D score was not different in cases with adequate or reduced quality ("good quality" 0.610 vs. "limited" one reader 0.595 vs. "limited" several readers 0.618; Uebersax method). Enhanced FDG-uptake in BAT was stated by at least one of the readers in 17 cases, in 11 cases the majority of the readers mentioned enhanced BAT uptake. The probability to concord on the D score decreased slightly with the occurrence of BAT uptake (0.611 without BAT uptake vs. 0.571 "BAT" one reader vs. 0.500 "BAT" three or more readers).

Discussion
In our study five readers with particular experience in the interpretation of PET scans in HL independently scored 100 cases randomly selected from a large clinical trial. All readers were able to directly compare interim PET/CT with the respective staging PET/CT. Inter-reader reliability of the 5 point D scale was rather poor. The estimated probability of two random readers to concord on the exact same D-score in a random case was only 42%. In particular there is massive inter-reader variability in cases assigned for D2 or D3.
Only the distinction D 1,2,3 versus D4,5 may be sufficiently reliable for clinical decision making (86% probability of agreement). This good result is driven by excellent concordance on negative cases while the conditional probability of concordance if one reader scores D4,5 was only 64%.

Possible explanations for the observed discrepancies
Ambiguity in interpreting reference uptake levels. The distribution of score values varied markedly between readers (Table 1). This suggests that different interpretations of the score level definition used in the Deauville scoring may play a role. This is especially relevant for the background definition (D1 vs. D2). The physiological uptake pattern causes broad variability of local backgrounds. Obviously some readers consequently use the local background (any visible uptake in a tumour residual is per definition >D1) while others also grade lesions with a slightly enhanced uptake in a near-zero background into group D1 (see Fig 1).
Another source of discrepancies is the definition of the mediastinal region (D2 vs. D3). While earlier scoring systems used the term "mediastinal blood pool" [6] the D scoring system uses the term "mediastinum" [8]. The inhomogeneity of the FDG-uptake in this anatomically multifarious region increases the risk of discrepancy. This problem is less relevant in the liver (D3 vs. D4) because of the large size of the organ and the only moderate inhomogeneity of the physiological FDG uptake, especially in children.
Finally, for the classification D4 vs. D5 the criterion "markedly enhanced uptake" is prone to subjective interpretation.
Ambiguity in assessing residual tumour uptake and comparing grey levels. The D criteria do not precisely prescribe which part of the residual uptake is to be used for comparison with reference regions. Is it the mean uptake in the whole residual uptake area or the mean uptake of its hottest part (how delineated?) or the maximum uptake which should be used for the visual comparison?
Inter-reader differences may also occur during the process of the visual comparison of the residual uptake with the reference regions. It can be difficult to compare grey levels across a distance in particular in case of small residuals (see Figs 2 or 4). In some cases residuum and reference structures cannot be compared directly since they are located in different planes. In addition, the comparison may be compromised by optical illusion effects like the simultaneous brightness contrast illusion or the subjective colour constancy (see Edward Adelson Chequershadow illusion 1995) [15].
Our recently published semi-automatic qPET quantification method [16] uses the average of the maximum voxel with the three highest adjacent voxels to quantify the residual uptake and a standardised VOI to determine the reference liver signal in an attempt to eliminate these sources of variability.
Enrichment of borderline cases. Concordance rates between readers and kappa coefficients depend on the proportion of clear-cut cases in the sample. In our data in 68/100 cases at least one reader scored residual uptake as D3 indicating a high proportion of borderline cases. This preponderance of borderline cases is typical in our setting (16) and particularly impairs the discrimination D2 vs. D3. This may be due to using intensive OEPA chemotherapy; after ABVD the distinction between metabolically responding and non-responding cases may be more clear-cut (see below).
Problem regions. Mainly three regions gave rise to discrepant interpretations: mediastinum, skeleton and neck lymph nodes (compare Table 3). In more than 60% of the cases readers differed in the D score of the mediastinum. Mostly minor discrepancies occurred concerning D1,2 vs D3 or less often D3 vs. D4. Here most of the sources of variability discussed above collude: a high proportion of borderline uptake in the residual lymphoma mass, inhomogeneous FDG uptake in the reference region "mediastinum", differences among readers in the way of looking on it and the discrimination of physiological structures like vessel walls or thymus from typically weak residual uptake.
Large discrepancies in the interpretation occurred in the skeleton. In nearly two thirds of the cases in whom at least one of the readers detected an active tumour residuum in the skeleton, these findings were either misinterpreted or overlooked by other readers. The risk of overlooking a residual uptake is increased in the skeleton because of the large size of the region and the low frequency of involvement.
The interpretation of the neck region is complicated due to difficulties in discriminating lymph nodes from salivary glands or other structures with physiologically enhanced FDG uptake, often small size of the residuum and large distance to the reference organs. In addition, the distinction between original lymphoma and newly appeared inflammatory lymph nodes may be intricate.
Acquisition variability, quality and brown fat. Many acquisition parameters may influence the image characteristics and the quality of the FDG PET scans [17][18]. This study evaluated 100 cases whose images were acquired in 51 different PET/CT centres in nine countries for a large European clinical trial with substantial acquisition variability. However, limited scan quality as reported by the readers was not associated with increased discrepancy rates indicating that acquisition variability is not the main cause for the observed discrepancies.
Strong FDG-uptake in brown adipose tissue (BAT) is well-known to complicate the interpretation of FDG-PET/CT scans. This is a particular issue in the post-treatment situation when small residuals have to be identified or excluded. Discrepancy rates were slightly higher in scans with high FDG uptake in BAT. However, the inter-reader agreement remained poor even in good quality scans without BAT uptake.

Comparison with other inter reader studies using Deauville criteria
Although the D scale is widely considered as well established most validation papers only report concordance data on the binary discrimination D1,2,3 versus D4,5. Only Oki at al. [19] report concordance data concerning the full five point scale after 2 ABVD cycles. Their results (pairwise kappa values: 0.13, 0.29, and 0.38) are in agreement with our own (0.14-0.39).
Inter reader concordance for the distinction D1,2,3 versus D45 depends on the proportion of border line cases in the respective population. In our data (after 2 OEPA), in 68% of the cases D3 was assigned by at least one reader; 24% of the cases can be considered D3 (i.e. the mean of the scores rounds to 3). This D3 proportion is remarkably higher than reported by Barrington (6/50; 12%) and Oki (6/229; 2.4%) after 2 ABVD.
OEPA is a more intense chemotherapy regimen than ABVD. Thus uptake in incomplete metabolic responders may be shifted from higher residual uptake intensities towards the D34 border. However the proportion of positive cases is not lower in our data than in the data of Oki or Biggi [19,20]. This on the other hand suggests a shift of clearly negative cases towards the D34 border, e.g. due to granulocyte activation or an enhanced debriding reaction.

Limitations
Inter-reader concordance of interim PET scoring may depend on the population included and possibly on the nature and cycle number of previously administered chemotherapy. As our results were obtained in a pediatric patient population we cannot exclude that physiological uptake in the thymus and activation of brown adipose tissue marginally increased the rate of discrepancies. We used a particular intensive chemotherapy (OEPA), which may have increased the proportion of difficult borderline cases as compared to less intense regimen (e.g. ABVD). Our images are from a multicenter setting; monocentric images may be easier to interpret.

Conclusion
We have shown that inter-reader reliability of the complete 5 point D scale is poor in a real world study setting. In particular there is significant inter-reader variability in cases assigned for D 2 or 3. The distinction D 1,2,3 versus 4,5 is the most reliable criterion for clinical decision making. The use of a semi-automatic algorithm for comparison of the residual uptake intensity with the reference levels (16) may help to avoid inter-reader discordances and make decisions on treatment reduction or-intensification in PET-guided trials more reliable.
Supporting Information S1